I have been searching for this in forums and on stackoverflow; it must be here somewhere but I couldn't find it.
I'm on a Mac, using the terminal to run a shell script to rename some pdf files based on file content.
I have a directory full of pdfs that I'm exporting to text files using the opensource pdfbox. The resulting files have the same name as the pdf file but end in .txt. I created the text files so that I could find a string inside the file with the format Page xx Question xx; for example Page 43 Question 2. Given this example, I would like to rename the pdf file as pg43_q2.pdf
I think the regular expression I want is this:
/Page\s+(\d+)Question\s+(\d+)
but I'm not sure how to read the two captured numbers and save them into a string that I can use as a filename.
The script I have so far is:
#!/bin/sh
PDF_FILE_PATH=$1
echo "Converting pdfs at $PDF_FILE_PATH"
find "$PDF_FILE_PATH" -name '*.pdf' -print0 | while IFS= read -r -d '' filename; do
echo $filename
java -jar pdfbox-app-1.6.0.jar ExtractText "$filename" "$filename.txt"
NEWNAME=$(sed -n -e '/Page/s/Page\s+\(\d+\)\s+Question\s+\(\d+\).*$/pg\1_q\2/p' "$filename.txt")
echo "Renaming pdf $filename to $NEWNAME"
# I would do this next but the $NEWNAME is empty
# mv "filename" "PDF_FILE_PATH$NEWNAME"
done
... but the sed command is not putting anything into the NEWNAME variable.
I'm not particularly attached to sed, any suggestions would be appreciated
Latest edit to script uses the following sed command:
newname=$(sed -nE -e '/Page/s/^.*Page[[:blank:]]+([0-9]+)[[:blank:]]+Question[[:blank:]]+([0-9]+).*$/pg\1_q\2.pdf/p' "$filename.txt")
This works about 50% of the time, but the rest of the time the newname variable is empty when I go to rename the file.
The third line of a converted file that does work:
Unit 2 Review Page 257 Question 9 a) 12 (2)(2)(3)
The third line of a converted file that doesn't work:
Unit 2 Review Page 258 Question 16 a) (a – 4)(a + 7) = a(a + 7) – 4(a + 7) = a2 + 7a – 4a – 28 = a2 + 3a – 28 b) (2x + 3)(5x + 2) = 2x(5x + 2) + 3(5x + 2) = 10x2 + 4x + 15x + 6 = 10x2 + 19x + 6 c) (–x + 5)(x + 5) = –x(x + 5) + 5(x + 5) = –x2 – 5x + 5x + 25 = –x2 + 25 d) (3y + 4)2 = (3y + 4)(3y + 4) = 3y(3y + 4) + 4(3y + 4) = 9y2 + 12y + 12y + 16 = 9y2 + 24y + 16 e) (a – 3b)(4a – b) = a(4a – b) – 3b(4a – b) = 4a2 – ab – 12ab + 3b2 = 4a2 – 13ab + 3b2 f) (v – 1)(2v2 – 4v – 9) = v(2v2 – 4v – 9) – 1(2v2 – 4v – 9) = 2v3 – 4v2 – 9v – 2v2 + 4v + 9 = 2v3 – 6v2 – 5v + 9
Removed unhelpful original answer
echo 'Unit 2 Review Page 257 Question 9 a) 12 (2)(2)(3)'\
| sed -n '/Page/{s/.*Page[ ][ ]*\([0-9][0-9]*\)[ ][ ]*Question[ ][ ]*\([0-9][0-9]*\).*$/pg\1_q\2/;p;q;}'
output
pg257_q9
echo 'Unit 2 Review Page 258 Question 16 a) (a 4)(a + 7) = a(a + 7) 4(a + 7)'\
| sed -n '/Page/{s/.*Page[ ][ ]*\([0-9][0-9]*\)[ ][ ]*Question[ ][ ]*\([0-9][0-9]*\).*$/pg\1_q\2/;p;q;}'
output
pg258_q16
Otherwise, you had it right!
(Note that the sed processing is the same for both cases).
I've included a trailing ;p;q}, and an initial { so the sed script will just process the line with 'Page' and then quit.
I've expanded the posix char classes to the basic terms, ie [[:digit:]] = [0-9], and replaced the +, with a repetition of the intitial char class followed by the 'zero-or-more' char '*', making [0-9][0-9]*. My personal experience, having learned sed on Sun 3 from OReilly's 2nd edition Sed and Awk (with the comb-binding!), is that all the posix stuff is a distraction and a further source of errors. I'm clearly in the minority on this here on S.O ;-), but I'm willing to admit that newer seds have some great features and in any case .....
I hope this helps.
Related
Am sure I'm way off here & there's probably an easier way to do this, but I have to change some cron text schedules & change the hour in which they run..
The cron itself is in a text file so I was thinking of using powershell to get the content, update it & save it.. but I'm not very good with regex etc with this. -I've cobbled something together from internet sources, but something is still slightly missing....
I can do all the grabbing of the file content & writing etc, it's just the slightly more advanced replacement of the 2nd variable that's an issue....
Here's my code:
$testText = "cron(0 20 * * ? *)"
$pattern = '(\().+?(\))'
$new1 = [regex]::Matches($testText, $pattern).Value
"new1 $new1"
$new2 = $($new1 -replace '\s+', ' ').split()
"new2 $new2"
$new3 = $new2[1]-1 #Set back an hour
#$new3 = $new2[1] -replace $new2[1],"22" #Specify Hour
"new3 $new3"
$new4 = $testText -Replace($new2,$new3)
"new4 $new4"
This results in this output - where annoyingly the first bit is knocked off..
new1 (0 20 * * ? *)
new2 (0 20 * * ? *)
new3 19
new4 cron(19* * ? *)
I've split this using spaces as the 1st item after the parenthesis can be 1 or 2 digits & after the 2nd set of chars(the hour) these can also be different.. i.e. cron(30 06 ? * MON-FRI *)
Any help would be appreciated... - it's been a long day & my brain is very tired!!
$testText = "cron(0 20 * * ? *)"
$cron = $testText.split(' ').split('(').split(')')
'cron(' +
$cron[1] + ' ' +
(([datetime]::Today).AddHours($cron[2] -1).Hour) + ' ' +
$cron[3] + ' ' +
$cron[4] + ' ' +
$cron[5] + ' ' +
$cron[6] +
')'
or perhaps a little easier
$testText = "cron(0 20 * * ? *)"
$cron = $testText.split(' ').split('(').split(')')
$cron[2] = ([datetime]::Today).AddHours($cron[2]-1).Hour
'cron(' + ($cron -join ' ') + ')'
I have a huge text file in the following pattern
####
Some Question 1
answer 1
####
####
Some Question 2
answer 2
some answer 2
another answer 2
####
####
Some Question 3
answer 3
some answer 3
####
in my project I need to:
1. find lines between two characters and I already did it by (####)(.+?)(####)
2. put a question mark at the end of the first line after ####
3. put a slash before the second line and before third line
to have a result like this
Some Question 1 ? answer1
Some Question 2 ? answer 2 / some answer 2 / another answer 2
Some Question 3 ? answer 3 / some answer 3
as I mentioned I already marked the text and made 3 groups \1 & 3 #### \2 the in-between lines, how can I separate those lines and make the desired changes ?
I recommend you to do this job outside of notepad, using a script launched from the command line interface.
If you have awk installed on your system, write the following script, say script.awk:
#!/usr/bin/awk -f
/^####$/ { if (q != "") {
print q a
}
q = "";
a = "";
next
}
# other lines
{ if (q == "") {
q = $0 " ? "
} else {
if (a == "") {
a = $0;
next
} else {
a = a " / " $0 ;
next
}
}
}
Assuming your input is in file input.txt, you can run this script from the command line issuing:
./script.awk input.txt
or:
awk -f script.awk input.txt
I assume you can work in a Unix-like environment.
I have a string representing a command line where a binary file and a series of arguments are given.
string = "./bin -m A 4 -n 12 --LongName1 12 --LongName2 45 -t Hello -l 0.002 "
I'd like to extract the numerical value that is associated to --LongName1. How can I do that? Note that --LongName2 does not necessarily follow LongName1. Anything could follow LongName1 including the end of the string.
I found a solution and it seems to work fine but it is really ugly:
re = regexpr("LongName1", string)
start = attr(re, "match.length") + re[1] + 1
nbdigits = which(is.na(sapply(strsplit(substr(string, start, nchar(string)), ""), as.numeric)))[1] - 1
as.numeric(substr(string, start, start + nbdigits - 1))
# 12
Use a regex with look-behind:
string = "./bin -m A 4 -n 12 --LongName1 12 --LongName2 45 -t Hello -l 0.002 "
pattern <- "(?<=--LongName1 )\\d*"
m <- regexpr(pattern, string, perl = TRUE)
regmatches(string, m)
#[1] "12"
I want to capture all certain occurrences in a string in Vimscript.
example:
let my_calculation = '200/3 + 23 + 100.5/3 -2 + 4*(200/2)'
How can I capture all numbers (including dots if there are) before and after the '/'? in 2 different variables:
- output before_slash: 200100.5200
- output after slash 332
How can I replace them if a condition occurs?
p.e. if after a single '/' there is no '.' add '.0' after this number
I tried to use matchstring and regex but after trying and trying I couldn't resolve it.
A useful feature that can be taken advantage of in this case is substitution
with an expression (see :help sub-replace-\=).
let [a; b] = [[]]
call substitute(s, '\(\d*\.\?\d\+\)/\(\d*\.\?\d\+\)\zs',
\ '\=add(a,submatch(1))[1:0]+add(b,submatch(2))[1:0]', 'g')
To answer the second part of the question:
let my_calculation = '200/3 + 23 + 100.5/3 -2 + 4*(200/2)'
echo substitute(my_calculation, '\(\/[0-9]\+\)\([^0-9.]\|$\)', '\1.0\2', 'g')
The above outputs:
200/3.0 + 23 + 100.5/3.0 -2 + 4*(200/2.0)
Give this a try:
function! GetNumbers(string)
let pairs = filter(split(a:string, '[^0-9/.]\+'), 'v:val =~ "/"')
let den = join(map(copy(pairs), 'matchstr(v:val, ''/\zs\d\+\(\.\d\+\)\?'')'), '')
let num = join(map(pairs, 'matchstr(v:val, ''\d\+\(\.\d\+\)\?\ze/'')'), '')
return [num, den]
endfunction
let my_calculation = '200/3 + 23 + 100.5/3 -2 + 4*(200/2)'
let [a,b] = GetNumbers(my_calculation)
echo a
echo b
I would like to visual select backwards a calculation p.e.
200 + 3 This is my text -300 +2 + (9*3)
|-------------|*
This is text 0,25 + 2.000 + sqrt(15/1.5)
|-------------------------|*
The reason is that I will use it in insert mode.
After writing a calculation I want to select the calculation (using a map) and put the results of the calculation in the text.
What the regex must do is:
- select from the cursor (see * in above example) backwards to the start of the calculation
(including \/-+*:.,^).
- the calculation can start only with log/sqrt/abs/round/ceil/floor/sin/cos/tan or with a positive or negative number
- the calculation can also start at the beginning of the line but it never goes back to
a previous line
I tried in all ways but could not find the correct regex.
I noted that backward searching is different then forward searching.
Can someone help me?
Edit
Forgot to mention that it must include also the '=' if there is one and if the '=' is before the cursor or if there is only space between the cursor and '='.
It must not include other '=' signs.
200 + 3 = 203 -300 +2 + (9*3) =
|-------------------|<SPACES>*
200 + 3 = 203 -300 +2 + (9*3)
|-----------------|<SPACES>*
* = where the cursor is
A regex that comes close in pure vim is
\v\c\s*\zs(\s{-}(((sqrt|log|sin|cos|tan|exp)?\(.{-}\))|(-?[0-9,.]+(e-?[0-9]+)?)|([-+*/%^]+)))+(\s*\=?)?\s*
There are limitations: subexpressions (including function arguments) aren't parsed. You'd need to use a proper grammar parser to do that, and I don't recommend doing that in pure vim1
Operator Mapping
To enable using this a bit like text-objects, use something like this in your $MYVIMRC:
func! DetectExpr(flag)
let regex = '\v\c\s*\zs(\s{-}(((sqrt|log|sin|cos|tan|exp)?\(.{-}\))|(-?[0-9,.]+(e-?[0-9]+)?)|([-+*/%^]+)))+(\s*\=?)?\s*'
return searchpos(regex, a:flag . 'ncW', line('.'))
endf
func! PositionLessThanEqual(a, b)
"echo 'a: ' . string(a:a)
"echo 'b: ' . string(a:b)
if (a:a[0] == a:b[0])
return (a:a[1] <= a:b[1]) ? 1 : 0
else
return (a:a[0] <= a:b[0]) ? 1 : 0
endif
endf
func! SelectExpr(mustthrow)
let cpos = getpos(".")
let cpos = [cpos[1], cpos[2]] " use only [lnum,col] elements
let begin = DetectExpr('b')
if ( ((begin[0] == 0) && (begin[1] == 0))
\ || !PositionLessThanEqual(begin, cpos) )
if (a:mustthrow)
throw "Cursor not inside a valid expression"
else
"echoerr "not satisfied: " . string(begin) . " < " . string(cpos)
endif
return 0
endif
"echo "satisfied: " . string(begin) . " < " . string(cpos)
call setpos('.', [0, begin[0], begin[1], 0])
let end = DetectExpr('e')
if ( ((end[0] == 0) || (end[1] == 0))
\ || !PositionLessThanEqual(cpos, end) )
call setpos('.', [0, cpos[0], cpos[1], 0])
if (a:mustthrow)
throw "Cursor not inside a valid expression"
else
"echoerr "not satisfied: " . string(begin) . " < " . string(cpos) . " < " . string(end)
endif
return 0
endif
"echo "satisfied: " . string(begin) . " < " . string(cpos) . " < " . string(end)
norm! v
call setpos('.', [0, end[0], end[1], 0])
return 1
endf
silent! unmap X
silent! unmap <M-.>
xnoremap <silent>X :<C-u>call SelectExpr(0)<CR>
onoremap <silent>X :<C-u>call SelectExpr(0)<CR>
Now you can operator on the nearest expression around (or after) the cursor position:
vX - [v]isually select e[X]pression
dX - [d]elete current e[X]pression
yX - [y]ank current e[X]pression
"ayX - id. to register a
As a trick, use the following to arrive at the exact ascii art from the OP (using virtualedit for the purpose of the demo):
Insert mode mapping
In response to the chat:
" if you want trailing spaces/equal sign to be eaten:
imap <M-.> <C-o>:let #e=""<CR><C-o>"edX<C-r>=substitute(#e, '^\v(.{-})(\s*\=?)?\s*$', '\=string(eval(submatch(1)))', '')<CR>
" but I'm assuming you wanted them preserved:
imap <M-.> <C-o>:let #e=""<CR><C-o>"edX<C-r>=substitute(#e, '^\v(.{-})(\s*\=?\s*)?$', '\=string(eval(submatch(1))) . submatch(2)', '')<CR>
allows you to hit Alt-. during insert mode and the current expression gets replaced with it's evaluation. The cursor ends up at the end of the result in insert mode.
200 + 3 This is my text -300 +2 + (9*3)
This is text 0.25 + 2.000 + sqrt(15/1.5)
Tested by pressing Alt-. in insert 3 times:
203 This is my text -271
This is text 5.412278
For Fun: ascii art
vXoyoEsc`<jPvXr-r|e.
To easily test it yourself:
:let #q="vXoyo\x1b`<jPvXr-r|e.a*\x1b"
:set virtualedit=all
Now you can #q anywhere and it will ascii-decorate the nearest expression :)
200 + 3 = 203 -300 +2 + (9*3) =
|-------|*
|-------------------|*
200 + 3 = 203 -300 +2 + (9*3)
|-----------------|*
|-------|*
This is text 0,25 + 2.000 + sqrt(15/1.5)
|-------------------------|*
1 consider using Vim's python integration to do such parsing
This seems quite a complicated task after all to achieve with regex, so if you can avoid it in any way, try to do so.
I've created a regex that works for a few examples - give it a try and see if it does the trick:
^(?:[A-Za-z]|\s)+((?:[^A-Za-z]+)?(?:log|sqrt|abs|round|ceil|floor|sin|cos|tan)[^A-Za-z]+)(?:[A-Za-z]|\s)*$
The part that you are interested in should be in the first matching group.
Let me know if you need an explanation.
EDIT:
^ - match the beginning of a line
(?:[A-Za-z]|\s)+ - match everything that's a letter or a space once or more
match and capture the following 3:
((?:[^A-Za-z]+)? - match everything that's NOT a letter (i.e. in your case numbers or operators)
(?:log|sqrt|abs|round|ceil|floor|sin|cos|tan) - match one of your keywords
[^A-Za-z]+) - match everything that's NOT a letter (i.e. in your case numbers or operators)
(?:[A-Za-z]|\s)* - match everything that's a letter or a space zero or more times
$ - match the end of the line