help with regex - extracting text - regex

Suppose I have some text files (f1.txt, f2.txt, ...) that looks something like
#article {paper1,
author = {some author},
title = {some {T}itle} ,
journal = {journal},
volume = {16},
number = {4},
publisher = {John Wiley & Sons, Ltd.},
issn = {some number},
url = {some url},
doi = {some number},
pages = {1},
year = {1997},
}
I want to extract the content of title and store it in a bash variable (call it $title), that is, "some {T}itle" in the example. Notice that there may be curly braces in the first set of braces. Also, there might not be white space around "=", and there may be more white spaces before "title".
Thanks so much. I just need a working example of how to extract this and I can extract the other stuff.

Give this a try:
title=$(sed -n '/^[[:blank:]]*title[[:blank:]]*=[[:blank:]]*{/ {s///; s/}[^}]*$//p}' inputfile)
Explanation:
/^[[:blank:]]*title[[:blank:]]*=[[:blank:]]*{/ { - If a line matches this regex
s/// - delete the matched portion
s/}[^}]*$//p - delete the last closing curly brace and every character that's not a closing curly brace until the end of the line and print
} - end if

title=$(sed -n '/title *=/{s/^[^{]*{\([^,]*\),.*$/\1/;s/} *$//p}' ./f1.txt)
/title *=/: Only act upon lines which have the word 'title' followed by a '=' after an arbitrary number of spaces
s/^[^{]*{\([^,]*\),.*$/\1/: From the beginning of the line look for the first '{' character. From that point save everything you find until you hit a comma ','. Replace the entire line with everything you saved
s/} *$//p: strip off the trailing brace '}' along with any spaces and print the result.
title=$(sed -n ... ): save the result of the above 3 steps in the bash variable named title

There are definitely more elegant ways, but at 2:40AM:
title=`cat test | grep "^\s*title\s*=\s*" | sed 's/^\s*title\s*=\s*{?//' | sed 's/}?\s*,\s*$//'`
Grep for the line that interests us, strip everything up to and including the opening curly, then strip everything from the last curly to the end of the line

Related

Vim find and replace everything after '=' until end of line ($)

I have a line of strings (10 to 20):
title = models.CharField(max_length=250)
...
field20 = models.DateField()
How can I use Vim's find and replace to remove everything from the equal sign until the end of line, so the text turns into:
title
...
field20
I'm familiar with the visual select find and replace (:s/\%Vfoo/bar/g), so I'd like to do something like (:s/\%V< = until $>//g), but this is what I am unsure how to find in Vim:
= until $
I looked at the vimregex wiki, and tried the following with no success:
:s/\%V = (.*?)$//g
Are you looking for this:
:%s/=.*/
?
It's the same as
:%s/=.*//
but the ending / can be omitted. You don't need the g btw because the pattern to the end of the line can match just once.
If you also want to remove the whitespace before the = use:
:%s/[[:blank:]]*=.*/
Note: If you think [[:blank:]] is hard to type, you can use \s instead. Both stand for space or tab.

How to filter a string for invalid filename characters using regex

My problem is that I don't want the user to type in anything wrong so I am trying to remove it and my problem is that I made a regex which removes everything except words and that also remove . , - but I need these signs to make the user happy :D
In a short summary: This Script removes bad characters in an input field using a regex.
Input field:
$CustomerInbox = New-Object System.Windows.Forms.TextBox #initialization -> initializes the input box
$CustomerInbox.Location = New-Object System.Drawing.Size(10,120) #Location -> where the label is located in the window
$CustomerInbox.Size = New-Object System.Drawing.Size(260,20) #Size -> defines the size of the inputbox
$CustomerInbox.MaxLength = 30 #sets max. length of the input box to 30
$CustomerInbox.add_TextChanged($CustomerInbox_OnTextEnter)
$objForm.Controls.Add($CustomerInbox) #adding -> adds the input box to the window
Function:
$ResearchGroupInbox_OnTextEnter = {
if ($ResearchGroupInbox.Text -notmatch '^\w{1,6}$') { #regex (Regular Expression) to check if it does match numbers, words or non of them!
$ResearchGroupInbox.Text = $ResearchGroupInbox.Text -replace '\W' #replaces all non words!
}
}
Bad Characters I don't want to appear:
~ " # % & * : < > ? / \ { | } #those are the 'bad characters'
Note that if you want to replace invalid file name chars, you could leverage the solution from How to strip illegal characters before trying to save filenames?
Answering your question, if you have specific characters, put them into a character class, do not use a generic \W that also matches a lot more characters.
Use
[~"#%&*:<>?/\\{|}]+
See the regex demo
Note that all these chars except for \ do not need escaping inside a character class. Also, adding the + quantifier (matches 1 or more occurrences of the quantified subpattern) streamlines the replacing process (matches whole consecutive chunks of characters and replaced all of them at once with the replacement pattern (here, empty string)).
Note you may also need to account for filenames like con, lpt1, etc.
To ensure the filename is valid, you should use the GetInvalidFileNameChars .NET method to retrieve all invalid character and use a regex to check whether the filename is valid:
[regex]$containsInvalidCharacter = '[{0}]' -f ([regex]::Escape([System.IO.Path]::GetInvalidFileNameChars()))
if ($containsInvalidCharacter.IsMatch(($ResearchGroupInbox.Text)))
{
# filename is invalid...
}
$ResearchGroupInbox.Text -replace '~|"|#|%|\&|\*|:|<|>|\?|\/|\\|{|\||}'
Or as #Wiketor suggest you can obviate it to '[~"#%&*:<>?/\\{|}]+'

Powershell capture group - insert some text after the matching line

I am trying to insert a line with powershell statement into multiple script files.
The content of the files are like these (3 cases)
"param($installPath)"
- no CRLF characters, only 1st line
"param($installPath)`r`n`"
- with CRLF characters, no 2nd line
"param($installPath)`r`n`some text on the second line"
- with CRLF characters, non-empty 2nd line
I want to insert some text (poweshell statement in the second line 'r'n$myvar = somethingelse) so it is immediately below the first line appending 'r'n characters to the first line if they don't exist
"param($installPath)`r`nmyvar = somethingelse"
- ADD CRLF character first to the 1st line and ADD $myvar = somethingelse on the 2nd line
"param($installPath)`r`n`$myvar = somethingelse"
- ONLY ADD "$myvar = somethingelse" on the 2nd line, since CRLF already exists (no need to add the ending rn)
"param($installPath)`r`n`$myvar = somethingelse`r`n`some text on the second line"**
- ADD "$myvar = somethingelse'r'n" on the 2nd line (CRLF already exists on the first line) and APPEND CRLF to it so the existing text on the second line will move to 3rd line.
I was trying to use this regular expression:
"^param(.+)(?:(rn))"
and this replacement, but with no success ($1 is the first capture group, $2 is non capture group which I ignore even if something is found and I explicitly add CRLF after $1 capture group)
"$1rnmyvar = somethingelse"
Thanks,
Rad
The following use of -replace seems to match your requirements
$content = "param(some/path)"
#$content = "param(some/path)`r`n"
#$content = "param(some/path)`r`n`some text on the second line"
$content = $content -replace "^(param\(.+\))(?:\r\n$)?",
( '$1' + "`r`nmyvar = somethingelse" )
Write-Host "`n$content"
Note that references to capture groups have to be in single quotes.
The optional, uncaptured (?:\r\n$) group ensures that CRLF is removed if there is nothing, i.e. the end of string $, following it.
Edit
If what follows param is not known, the following regex could be used instead.
It uses [^\r\n] to capture characters that are not newlines.
"^(param[^\r\n]*)(?:\r\n$)?"

Find and Replace first character after a certain pattern

Current text
Variable length text = some string(some more text
Change to
Variable length text = some string(addition some more text
Need to add a certain text after first parenthesis in a line only after "=" character is encountered. Another condition is to ignore patterns like "= (", which essentially means you should ignore patterns with only space between "=" and "("
My Try:
sed -e "s#*=(\w\()#\1(addition#g"
Thanks in anticipation!
Tweak this for your needs:
$ echo 'Variable length text = some string(some more text' |\
sed 's/^[^=]*=[^(]*[[:alnum:]][^(]*(/&addition /'
That matches for:
Beginning of the string
Anything but = any number of times
=
Anything but ( any number of times
An alpha-numeric character
Anything but ( any number of times
(
... and substitutes it with the matched string adding ' addition' to it.
The output is
Variable length text = some string(addition some more text
in perl
s/(.*?=[\s][^(]+?)\((.*)/$1(aditional text $2/

Is it possible to conditionally insert text via regex substitution in Vim?

I have have several lines from a table that I’m converting from Excel to the Wiki format, and want to add link tags for part of the text on each line, if there is text in that field. I have started the converting job and come to this point:
|10.20.30.9||x|-||
|10.20.30.10||x|s04|Server 4|
|10.20.30.11||x|s05|Server 5|
|10.20.30.12|||||
|10.20.30.13|||||
What I want is to change the fourth column from, e.g., s04 to [[server:s04]]. I do not wish to add the link brackets if the line is empty, or if it contains -. If that - is a big problem, I can remove it.
All my tries on regex to get anything from the line ends in the whole line being replaced.
Consider using awk to do this:
#!/bin/bash
awk -F'|' '
{
OFS = "|";
if ($5 != "" && $5 != "-")
$5 = "server:" $5;
print $0
}'
NOTE: I've edited this script since the first version. This current one, IMO is better.
Then you can process it with:
cat $FILENAME | sh $AWK_SCRIPTNAME
The -F'|' switch tells awk to use | as a field separator. The if/else and printf statements are pretty self explanatory. It prints the fields, with 'server:' prepended to column 5, only if it is not "-" or "".
Why column 5 and not column 4?: Because you use | at the beginning of each record. So awk takes the 'first' field ($1) to be an empty string that it believes should have occured before this first |.
This seems to do the job on the sample you give up there (with Vim):
%s/^|\%([^|]*|\)\{3}\zs[^|]*/\=(empty(submatch(0)) || submatch(0) == '-') ? submatch(0) : '[[server:'.submatch(0).']]'/
It's probably better to use awk as ArjunShankar writes, but this should work if you remove "-" ;) Didn't get it to work with it there.
:%s/^\([^|]*|\)\([^|]*|\)\([^|]*|\)\([^|]*|\)\([^|]\+|\)/\1\2\3\4[[server:\5]]/
It's just stupid though. The first 4 are identical (match anything up to | 4 times). Didn't get it to work with {4}. The fifth matches the s04/s05-strings (just requires that it's not empty, therefor "-" must be removed).
Adding a bit more readability to the ideas given by others:
:%s/\v^%(\|.{-}){3}\|\zs(\w+)/[[server:\1]]/
Job done.
Note how {3} indicates the number of columns to skip. Also note the use of \v for very magic regex mode. This reduces the complexity of your regex, especially when it uses more 'special' characters than literal text.
Let me recommend the following substitution command.
:%s/^|\%([^|]*|\)\{3}\zs[^|-]\+\ze|/[[server:&]]/
try
:1,$s/|\(s[0-9]\+\)|/|[[server:\1]]|/
assuming that your s04, s05 are always s and a number
A simpler substitution can be achieved with this:
%s/^|.\{-}|.\{-}|.\{-}|\zs\(\w\{1,}\)\ze|/[[server:\1]]/
^^^^^^^^^^^^^^^^^^^^ -> Match the first 3 groups (empty or not);
^^^ -> Marks the "start of match";
^^^^^^^^^^^ -> Match only if the 4th line contains letters numbers and `_` ([0-9A-Za-z_]);
^^^ -> Marks the "end of match";
If the _character is similar to -, can appear but must not be substituted, uset the following regex: %s/^|.\{-}|.\{-}|.\{-}|\zs\([0-9a-zA-Z]\{1,}\)\ze|/[[server:\1]]/