Match character placeholders (places where an input cursor can be putted) - regex

I work with Visual Studio Code and I have a problem with a 1,000 lines long .md document in which generally each line contains one or more sentence.
I desire to wrap each sentence with vertical bars (one from the left and one from the right, with respective empty spaces), for the process of transforming the long list of sentences into a (single columned) markdown table.
Current input
sentence
Desired input
| Sentence |
or:
| Sentence. Sentence |
and so on...
How I thought to do it
In general, I can put my input cursor (l-beam cursor) anywhere beside characters in a text field;
I assume that any such "place" (where I can put my input cursor), is plausible to be named a "Character Placeholder" (CP).
I assume that CPs are created per characters (for example, a line with only one character would contain two CPs) and if so, one could freely match CP1 and CP2 (or CP0 and CP1 - depends on base index), before and after that character respectively.
I would like to command VSCODE to add a vertical bar and a respective empty space (|U+0020) in the CP available before the first character in every line, as well as in the CP available after the last character in every line (U+0020|) .
My question
As I only know ways to match characters (or sets of characters) themselves, with regex, but I don't know how to match CPs only, I ask:
How could one match CPs if at all, with current technology, so to command a program to add data X in CP Y?

This is simple to do with regex. regex has identifiers for 'start of' and 'end of' strings. (depending on your input you can treat each sentence as its own string).
To match start of strings the regex is - ^ while to match the end of strings the regex is $.
Now to implement your request all you need to do is match the whole line using -
^(.*?)$ and replace it with |\s$1\s| (the $1 is a back reference to the captured group) It would look something like - Search ^(.*)$ Replace |\s$1\s|

Related

Using Gsub to get matched strings in R - regular expression

I am trying to extract words after the first space using
species<-gsub(".* ([A-Za-z]+)", "\1", x=genus)
This works fine for the other rows that have two words, however row [9] "Eulamprus tympanum marnieae" has 3 words and my code is only returning the last word in the string "marnieae". How can I extract the words after the first space so I can retrieve "tympanum marnieae" instead of "marnieae" but have the answers stored in one variable called >species.
genus
[9] "Eulamprus tympanum marnieae"
Your original pattern didn't work because the subpattern [A-Za-z]+ doesn't match spaces, and therefore will only match a single word.
You can use the following pattern to match any number of words (other than 0) after the first, within double quotes:
"[A-Za-z]+ ([A-Za-z ]+)" https://regex101.com/r/p6ET3I/1
https://regex101.com/r/p6ET3I/2
This is a relatively simple, but imperfect, solution. It will also match trailing spaces, or just one or more spaces after the first word even if a second word doesn't exist. "Eulamprus " for example will successfully match the pattern, and return 5 spaces. You should only use this pattern if you trust your data to be properly formatted.
A more reliable approach would be the following:
"[A-Za-z]+ ([A-Za-z]+(?: [A-Za-z]+)*)"
https://regex101.com/r/p6ET3I/3
This pattern will capture one word (following the first), followed by any number of addition words (including 0), separated by spaces.
However, from what I remember from biology class, species are only ever comprised of one or two names, and never capitalized. The following pattern will reflect this format:
"[A-Za-z]+ ([a-z]+(?: [a-z]+)?)"
https://regex101.com/r/p6ET3I/4

Using a regex to extract a set of numbers and/or blank lines

I'm constructing a regex using PCRE to process text to extract a set of numbers from a set of text lines (the lines are produced by parsing HTML with XPATH but the question doesn't depend on that). If the number required isn't present, I need to return a blank line.
I'm using a module in Drupal called Feeds Tamper that provides a limited set of options to modify the content -- including a Regex find and replace based on PCRE (not PCRE2). I have options to do a sequence of Regex Find and Replace and/or simple Find and Replace.
The input takes the format:
Text A Location1 More text q=1,2)" Even more text
Text B
Text C Location1 More text q=3,4)" Even more text
Text D
There can be any number of lines including and not including the digits I want to extract; the last line may or may not have a digit in it; I need to process all the lines and end up with one result per line and no extras. The results are then replaced with a capturing group.
My search Regex currently looks like
.*?Location1.*?q=(.*?),(.*?)".*?(\r|$)|.*?(\r|$)
and my replacement like
\1|
but (see regex101.com) this gives results such as
1||
||
3||
||
||
where the expected output is:
1|
|
3|
|
i.e there is an extra line at the end that doesn't correspond to an input line, and an extra pipe character at the end of each line.
If I use
.*?Location1.*?q=(.*?),(.*?)".*?\r|.*?\r
the last line is omitted so I get:
1|
|
3|
If I don't add a pipe | to end of the substitution I get the right number of lines with the expected content (digit or blank), but as soon as I add something at the end of the substitutionI get an extra line and the substituted characte ris doubled.
What do I need to change in my Regex and why?
Something like this:
^(?:.*Location1.*?q=(\d+),(\d+))?.*$
First it matches start of line, optionally followed by the "required" Location and q= parts and captures the numbers. Finally it matches anything up to the end.
Here at regex101.

sed using regex example

I'm going over some legacy code and found this code:
cat some_file | \
sed "/^\/${CATEGORY}\/latest\//s: /.*$: ${DATA_PATH}:"
The format of the original file looks like:
/car/latest/ /US/car/2017/04/02
/bike/latest/ /US/bike/2017/03/31
/boat/latest/ /US/boat/2017/04/03
Assume the CATEGORY above is bike, and the DATA_PATH is /US/bike/2017/04/02, I guess the output will be like this, otherwise it does not make any sense.
/car/latest/ /US/car/2017/04/02
/bike/latest/ /US/bike/2017/04/02
/boat/latest/ /US/boat/2017/04/03
If so, what does the "s: /.*$:" do here? Why doesn't "/boat/latest/ /US/boat/2017/04/03" get substituted since we are replacing to the end (using the dollar sign).
If not, then what will be the output?
Thanks!
As the sed part is the issue, let us break it down:
/^/${CATEGORY}/latest// -- So this first part says to find all lines that follow this pattern, assuming CATEGORY = bike --- ^/bike/latest/. Note that ^ means the line must start with this
s: /.*$: ${DATA_PATH}: -- Once we have found lines matching the above this replacement is performed. first note is that the "normal" / delimiter has been replaced by :. Now if you look closely, it reads like this -- match a space followed by / and then all characters until the end of the line. the 'space' is the key as the only place on each line where you find a space followed by / is at the start of the second column, namely :- /US/bike/2017/03/31, using our bike example. The replacement portion also uses "space" + DATA_PATH
if we take a single line of our data (where we have bike), the matching portion is:
/bike/latest/ /US/bike/2017/03/31
^^^^^^^^^^^^^^^^^^^^
Note how the first ^ is prior to the / in front of US
The expression will match /bike/latest/ in your example. The /.*$ substitution replaces space followed by slash followed by any characters up to the end of the line. If DATA_PATH is the same as what is being replaced then this actually does nothing. Try replacing DATA_PATH with something else and you can see the substitution.
Just to clarify, the substitution replaces everything after a slash that is preceded by a space. There are no spaces before any of the category paths, e.g. /bike/latest/

How can I use vim to substitute all whole lines that match a Regex

There I have a tex file which contains serval paragraphs like:
\paragraph{name1}
...
\paragraph{name2}
...
Now I want to substitute all the "paragraph" with item, just like:
\item
...
\item
...
to reach that I have tried many commands and finally i used this:
(note that I used "a:" to "z:" as paragraph names)
**:% s/\\paragraph[{][a-z]:[}]/\\item/g**
and I think that is nether pretty nor efficient. I have tried to match the line contains "paragraph" but somehow only this word is replaced. Now that I can delete all such lines with
**:% g/_*paragraph_*/d**
are there anyway better to perform a substitute in the same way?(or to say to substitute all the line contains a specific word)
Your first attempt was almost correct. Rather than this
:% s/\paragraph[{][a-z]:[}]/\item/g
Use this
:% s/^\\paragraph{[a-z|0-9]\+}$/\\item/g
Let's break it down piece by piece:
The ^ character matches the start of the line, so that you don't match something like this:
Some text \paragraph{abc}
The reason why we use \\ instead of \ is because \ is an escape character, so to match it, we escape the escape character.
Doing [a-z|0-9]\+ will match one or more a-z or 0-9 characters, which is what I assume your paragraph names are composed of. If you need capital letters, you could do something like [a-zA-Z|0-9]\+.
Finally, we anchor the expression to the end of the line with $, so that it does not match lines that don't fit this pattern exactly.
Easy way to do with macro!
First, search the pattern using / like /\paragraph
Let's start the macro. Clear register a by pressing qaq.
Press qa to start recording in register a.
Press n to go its occurence. Then, press c$ to delete till end of line and to insert the text. Then, type the text and then press escape key.
Press #a to repeat the process. End macro by pressing q.
Now, macro is recorded and you can press #a once to make changes in all such lines.
You can do this:
:%s/\\paragraph{[^{}]*}/\\item/g
This finds all occurrences of \paragraph{, followed by 0 or more non-{} characters, followed by } (i.e. something like \paragraph{stuff here}), and replaces them by \item.
Or if you want to replace all lines containing paragraph:
:%s/^.*paragraph.*$/\\item/

How do I add a newline before each non-ascii word in Sublime-Text2?

I have a poorly formatted csv file of Korean words with English definitions. I'd like to add a new line before each Korean word. For example:
# I'd like to change this
하다,to do,크기,size,대기,on hold,
# Into this
하다,to do,
크기,size,
대기,on hold,
Using the regex ([^\x00-\x7F]*) I was able to highlight all instances of Korean words but when I try to replace them with \n$1 it works only for the first word after my last cursor position and then inserts a newline after each character.
Use + instead of *, (the former means 1 or more, the latter means 0 or more). Otherwise you get zero-width matches at every position:
[^\x00-\x7F]+