sed using regex example - regex

I'm going over some legacy code and found this code:
cat some_file | \
sed "/^\/${CATEGORY}\/latest\//s: /.*$: ${DATA_PATH}:"
The format of the original file looks like:
/car/latest/ /US/car/2017/04/02
/bike/latest/ /US/bike/2017/03/31
/boat/latest/ /US/boat/2017/04/03
Assume the CATEGORY above is bike, and the DATA_PATH is /US/bike/2017/04/02, I guess the output will be like this, otherwise it does not make any sense.
/car/latest/ /US/car/2017/04/02
/bike/latest/ /US/bike/2017/04/02
/boat/latest/ /US/boat/2017/04/03
If so, what does the "s: /.*$:" do here? Why doesn't "/boat/latest/ /US/boat/2017/04/03" get substituted since we are replacing to the end (using the dollar sign).
If not, then what will be the output?
Thanks!

As the sed part is the issue, let us break it down:
/^/${CATEGORY}/latest// -- So this first part says to find all lines that follow this pattern, assuming CATEGORY = bike --- ^/bike/latest/. Note that ^ means the line must start with this
s: /.*$: ${DATA_PATH}: -- Once we have found lines matching the above this replacement is performed. first note is that the "normal" / delimiter has been replaced by :. Now if you look closely, it reads like this -- match a space followed by / and then all characters until the end of the line. the 'space' is the key as the only place on each line where you find a space followed by / is at the start of the second column, namely :- /US/bike/2017/03/31, using our bike example. The replacement portion also uses "space" + DATA_PATH
if we take a single line of our data (where we have bike), the matching portion is:
/bike/latest/ /US/bike/2017/03/31
^^^^^^^^^^^^^^^^^^^^
Note how the first ^ is prior to the / in front of US

The expression will match /bike/latest/ in your example. The /.*$ substitution replaces space followed by slash followed by any characters up to the end of the line. If DATA_PATH is the same as what is being replaced then this actually does nothing. Try replacing DATA_PATH with something else and you can see the substitution.
Just to clarify, the substitution replaces everything after a slash that is preceded by a space. There are no spaces before any of the category paths, e.g. /bike/latest/

Related

Match character placeholders (places where an input cursor can be putted)

I work with Visual Studio Code and I have a problem with a 1,000 lines long .md document in which generally each line contains one or more sentence.
I desire to wrap each sentence with vertical bars (one from the left and one from the right, with respective empty spaces), for the process of transforming the long list of sentences into a (single columned) markdown table.
Current input
sentence
Desired input
| Sentence |
or:
| Sentence. Sentence |
and so on...
How I thought to do it
In general, I can put my input cursor (l-beam cursor) anywhere beside characters in a text field;
I assume that any such "place" (where I can put my input cursor), is plausible to be named a "Character Placeholder" (CP).
I assume that CPs are created per characters (for example, a line with only one character would contain two CPs) and if so, one could freely match CP1 and CP2 (or CP0 and CP1 - depends on base index), before and after that character respectively.
I would like to command VSCODE to add a vertical bar and a respective empty space (|U+0020) in the CP available before the first character in every line, as well as in the CP available after the last character in every line (U+0020|) .
My question
As I only know ways to match characters (or sets of characters) themselves, with regex, but I don't know how to match CPs only, I ask:
How could one match CPs if at all, with current technology, so to command a program to add data X in CP Y?
This is simple to do with regex. regex has identifiers for 'start of' and 'end of' strings. (depending on your input you can treat each sentence as its own string).
To match start of strings the regex is - ^ while to match the end of strings the regex is $.
Now to implement your request all you need to do is match the whole line using -
^(.*?)$ and replace it with |\s$1\s| (the $1 is a back reference to the captured group) It would look something like - Search ^(.*)$ Replace |\s$1\s|

Escaping invalid markdown using python regex

I've been trying to write some python to escape 'invalid' markdown strings.
This is for use with a python library (python-telegram-bot) which requires unused markdown characters to be escaped with a \.
My aim is to match lone *,_,` characters, as well as invalid hyperlinks - eg, if no link is provided, and escape them.
An example of what I'm looking for is:
*hello* is fine and should not be changed, whereas hello* would become hello\*. On top of that, if values are nested, they should not be escaped - eg _hello*_ should remain unchanged.
My thought was to match all the doubles first, and then replace any leftover lonely characters. I managed a rough version of this using re.finditer():
def parser(txt):
match_md = r'(\*)(.+?)(\*)|(\_)(.+?)(\_)|(`)(.+?)(`)|(\[.+?\])(\(.+?\))|(?P<astx>\*)|(?P<bctck>`)|(?P<undes>_)|(?P<sqbrkt>\[)'
for e in re.finditer(match_md, txt):
if e.group('astx') or e.group('bctck') or e.group('undes') or e.group('sqbrkt'):
txt = txt[:e.start()] + '\\' + txt[e.start():]
return txt
note: regex was written to match *text*, _text_, `text`, [text](url), and then single *, _, `, [, knowing the last groups
But the issue here, is of course that the offset changes as you insert more characters, so everything shifts away. Surely there's a better way to do this than adding an offset counter?
I tried to use re.sub(), but I haven't been able to find how to replace a specific group, or had any luck with (?:) to 'not match' the valid markdown.
This was my re.sub attempt:
def test(txt):
match_md = r'(?:(\*)(.+?)(\*))|' \
'(?:(\_)(.+?)(\_))|' \
'(?:(`)(.+?)(`))|' \
'(?:(\[.+?\])(\(.+?\)))|' \
'(\*)|' \
'(`)|' \
'(_)|' \
'(\[)'
return re.sub(match_md, "\\\\\g<0>", txt)
This just prefixed every match with a backslash (which was expected, but I'd hoped the ?: would stop them being matched.)
Bonus would be if \'s already in the string were escaped too, so that they wouldn't interfere with the markdown present - this could be a source of error, as the library would see it as escaped, causing it see the rest as invalid.
Thanks in advance!
You are probably looking for a regular expression like this:
def test(txt):
match_md = r'((([_*]).+?\3[^_*]*)*)([_*])'
return re.sub(match_md, "\g<1>\\\\\g<4>", txt)
Note that for clarity I just made up a sample for * and _. You can expand the list in the [] brackets easily. Now let's take a look at this thing.
The idea is to crunch through strings that look like *foo_* or _bar*_ followed by text that doesn't contain any specials. The regex that matches such a string is ([_*]).+?\1[^_*]*: We match an opening delimiter, save it in \1, and go further along the line until we see the same delimiter (now closing). Then we eat anything behind that that doesn't contain any delimiters.
Now we want to do that as long as no more delimited strings remain, that's done with (([_*]).+?\2[^_*]*)*. What's left on the right side now, if anything, is an isolated special, and that's what we need to mask. After the match we have the following sub matches:
g<0> : the whole match
g<1> : submatch of ((([_*]).+?\3[^_*]*)*)
g<2> : submatch of (([_*]).+?\3[^_*]*)
g<3> : submatch of ([_*]) (hence the \3 above)
g<4> : submatch of ([_*]) (the one to mask)
What's left to you now is to find a way how to treat the invalid hyperlinks, that's another topic.
Update:
Unfortunately this solution masks out valid markdown such as *hello* (=> \*hello\*). The work around to fix this would be to add a special char to the end of line and remove the masked special char once the substitution is done. OP might be looking for a better solution.

How can I use vim to substitute all whole lines that match a Regex

There I have a tex file which contains serval paragraphs like:
\paragraph{name1}
...
\paragraph{name2}
...
Now I want to substitute all the "paragraph" with item, just like:
\item
...
\item
...
to reach that I have tried many commands and finally i used this:
(note that I used "a:" to "z:" as paragraph names)
**:% s/\\paragraph[{][a-z]:[}]/\\item/g**
and I think that is nether pretty nor efficient. I have tried to match the line contains "paragraph" but somehow only this word is replaced. Now that I can delete all such lines with
**:% g/_*paragraph_*/d**
are there anyway better to perform a substitute in the same way?(or to say to substitute all the line contains a specific word)
Your first attempt was almost correct. Rather than this
:% s/\paragraph[{][a-z]:[}]/\item/g
Use this
:% s/^\\paragraph{[a-z|0-9]\+}$/\\item/g
Let's break it down piece by piece:
The ^ character matches the start of the line, so that you don't match something like this:
Some text \paragraph{abc}
The reason why we use \\ instead of \ is because \ is an escape character, so to match it, we escape the escape character.
Doing [a-z|0-9]\+ will match one or more a-z or 0-9 characters, which is what I assume your paragraph names are composed of. If you need capital letters, you could do something like [a-zA-Z|0-9]\+.
Finally, we anchor the expression to the end of the line with $, so that it does not match lines that don't fit this pattern exactly.
Easy way to do with macro!
First, search the pattern using / like /\paragraph
Let's start the macro. Clear register a by pressing qaq.
Press qa to start recording in register a.
Press n to go its occurence. Then, press c$ to delete till end of line and to insert the text. Then, type the text and then press escape key.
Press #a to repeat the process. End macro by pressing q.
Now, macro is recorded and you can press #a once to make changes in all such lines.
You can do this:
:%s/\\paragraph{[^{}]*}/\\item/g
This finds all occurrences of \paragraph{, followed by 0 or more non-{} characters, followed by } (i.e. something like \paragraph{stuff here}), and replaces them by \item.
Or if you want to replace all lines containing paragraph:
:%s/^.*paragraph.*$/\\item/

vi: :s how to replace only the second occurence on a line?

:s/u/X/2 - this replaces the first u to X on the current and next line...
or to replace the second character on a line with X???? IDK.
or perhaps its something other than :s?
I suspect I have to use grouping of some kind (\2?) but I don't know to write that.
I heard that sed and :s option in sed are alike, and on a help page for sed I found:
3.1.3. Substitution switches:
Standard versions of sed support 4 main flags or switches which may be added to
the end of an "s///" command. They are:
N - Replace the Nth match of the pattern on the LHS, where
N is an integer between 1 and 512. If N is omitted,
the default is to replace the first match only.
g - Global replace of all matches to the pattern.
p - Print the results to stdout, even if -n switch is used.
w file - Write the pattern space to 'file' if a replacement was
done. If the file already exists when the script is
executed, it is overwritten. During script execution,
w appends to the file for each match.
http://sed.sourceforge.net/sedfaq3.html#s3.1.3
so: :r! sed 's/u/X/2' would work, although I think there is a specifically vi way of doing this?
IDK if its relevant but I'm using the tcsh shell.
also,
:version:
Version 1.79 (10/23/96) The CSRG, University of California, Berkeley.
This is brittle, but may be enough to do what you want. This switch command with regex:
:%s/first\(.\{-}\)first/first\1second/g
converts this:
first and first then first again
first and first then first again
first and first then first again
first and first then first again
to this:
first and second then first again
first and second then first again
first and second then first again
first and second then first again
The regexp looks for the first "first", followed by a match of any characters using pattern .\{-}, which is the non-greedy version of .* (type :help non-greedy in vim for more info.) This non-greedy match is followed with the second "first".
The characters between the first and second "first" are captured by surrounding the .\{-} with parenthesis, which, with escaping results in \(.\{-}\), then that captured group is dereferenced with the \1 (1 means first captured group) in the replacement.
In order to substitute the second occurrence on a line, you can say:
:call feedkeys('nyq') | s/u/X/gc
In order to invoke it over a range of lines or the entire file, use it in a function:
:function Mysub()
: call feedkeys('nyq') | s/u/X/gc
:endfunction
For example, the following would substitute the second occurrence of u for X in every line in the file:
:1,$ call Mysub()
Here's a dumber but easier to understand way: first find a string that doesn't exist in the file - for the sake of argument assume it's zzz. then simply:
:%s/first/zzz
:%s/first/second
:%s/zzz/first

Regex: Match any character (including whitespace) except a comma

I would like to match any character and any whitespace except comma with regex. Only matching any character except comma gives me:
[^,]*
but I also want to match any whitespace characters, tabs, space, newline, etc. anywhere in the string.
EDIT:
This is using sed in vim via :%s/foo/bar/gc.
I want to find starting from func up until the comma, in the following example:
func("bla bla bla"
"asdfasdfasdfasdfasdfasdf"
"asdfasdfasdf", "more strings")
I
To work with multiline in SED using RegEx, you should look at here.
EDIT:
In SED command, working with NewLine is a bit different. SED command support three patterns to manage multiline operations N, P and D. To see how it works see this(Working with Multiple Lines) explaination. Here these three operations discussed.
My guess is that N operator is the area of consideration that is missing from here. Addition of N operator will allows to sense \n in string.
An example from here:
Occasionally one wishes to use a new line character in a sed script.
Well, this has some subtle issues here. If one wants to search for a
new line, one has to use "\n." Here is an example where you search for
a phrase, and delete the new line character after that phrase -
joining two lines together.
(echo a;echo x;echo y) | sed '/x$/ { N s:x\n:x: }'
which generates
a xy
However, if you are inserting a new line, don't use "\n" - instead
insert a literal new line character:
(echo a;echo x;echo y) | sed 's:x:X\ :'
generates
a X
y
So basically you're trying to match a pattern over multiple lines.
Here's one way to do it in sed (pretty sure these are not useable within vim though, and I don't know how to replicate this within vim)
sed '
/func/{
:loop
/,/! {N; b loop}
s/[^,]*/func("ok"/
}
' inputfile
Let's say inputfile contains these lines
func("bla bla bla"
"asdfasdfasdfasdfasdfasdf"
"asdfasdfasdf", "more strings")
The output is
func("ok", "more strings")
Details:
If a line contains func, enter the braces.
:loop is a label named loop
If the line does not contain , (that's what /,/! means)
append the next line to pattern space (N)
branch to / go to loop label (b loop)
So it will keep on appending lines and looping until , is found, upon which the s command is run which matches all characters before the first comma against the (multi-line) pattern space, and performs a replacement.