Filename extraction with regex

Filename extraction with regex - regex

I need to be able to only extract the filename (info.txt) from a line like:
07/01/2010 07:25p 953 info.txt
I've tried using this: /d+\s+\d+\s+\d+\s+(?.?)/, but it doesn't seem to work ...

How about
/\S+$/
I.e. the longest possible string of non-whitespace at the end of the line.
(Hard to know for sure without more info about the possible inputs.)
As #J V pointed out, filenames with spaces in them (like his username) will not be parsed properly by the above regexp. We don't know from the question whether that's possible.
But I have a suspicion that we're looking at the output of Windows DIR command, or something very similar. In that case, the most reliable approach might be just to hack off the first 39 characters and keep the rest:
/^.{39}(.+)$/
Then $1 will contain the filename.
Better option:
But if you are using Windows DIR (as per your new comment), and you can control the DIR command, try
DIR /b
which removes the unneeded cruft (assuming you don't need the date, size etc.) and gives you one filename per line.
OK, you're using a Unix dir (per newer comment). The CentOS dir I have outputs one file per line, nothing else, when you give it no command line options. Chances are very good that whichever dir you're using can be persuaded to output filenames like that... then you wouldn't have to worry about using a regex that may or may not be correct for every possible input. Try man dir or dir --help to find out what command-line options to use.

\d\d:\d\d\w\s+\d+\s+(.*?)$
$1 will be the file name
The problem with your original regex is that it forgets the special characters :, /, and (?.?) means nothing...

Assuming that the files have extension as .txt you can try.
(?<=(\s)*)\w*.txt

Why not just use the following regex:
\w+\.\w+

Related

replace part of file name with wrong encoding

Need some guidance how to solve this one. Have 10 000s of files in multiple subfolders where the encoding got screwed up. Via ls command I see a filename named like this 'F'$'\366''ljesedel.pdf', that includes the ' at beginning and end. That's just one example where the Swedish characters åäö got wrong, in this example this should have been 'Följesedel.pdf'. If If I run
#>find .
Then I see a list of files like this:
./F?ljesedel.pdf
Not the same encoding. How on earth solving this one? The most obvious ways:
myvar='$'\366''
char="ö"
find . -name *$myvar* -exec rename 's/$myvar/ö' {} \;
and other possible ways fails since
find . -name cannot find it due to the ? instead of the "real" characters " '$'\366'' "
Any suggestions or guidance would be very much appreciated.

The first question is what encoding your terminal expects. Make sure that is UTF-8.
Then you need to find what bytes the actual filename contains, not just what something might display it as. You can do this with a perl oneliner like follows, run in the directory containing the file:
perl -E'opendir my $dh, "."; printf "%s: %vX\n", $_, $_ for grep { m/jesedel\.pdf/ } readdir $dh'
This will output the filename interpreted as UTF-8 bytes (if you've set your terminal to that) followed by the hex bytes it actually contains.
Using that you can determine what your search pattern should be. Your replacement must be the UTF-8 encoded representation of ö, which it will be by default as part of the command arguments if your terminal is set to that.

I'm not an expert - but it might not be a problem with the file name (which seems to hold the correct Unicode file name) - but with the way ls (and many other utilities) show the name to the terminal.
I was able to show the correct name by setting the terminal character encoding to Unicode. Also I've noticed the GUI programs (file manager, etc), were able to show the correct file name.
Gnome Terminal: "Terminal .. set character encoding - Unicode UTF8
It is still a challenge with many utilities to 'select' those files (e.g., REGEXP, wildcard). In few cases, you will have to select those character using '*' pattern. If this is a major issue considering using Ascii only - may be use the 'o' instead of 'ö'. Not sure if this is acceptable.

Why is this vim regex so expensive: s/\n/\\n/g

Attempting this on a sufficiently large file (say 80,000+ lines and about 500k+) will crash things or stall eventually both on my server and on my local Mac.
I've tried this at the command line as well, with the same result:
vim -es -c '%s/\n/\\n/g' -c wq $file
Also, the problem appears to be with the selection (\n) and not the replacement (\\n).
For my larger files I can of course split them and cat them back when finished, but the split points cannot be arbitrary in my case and must be adjusted manually for each and every split.
I appreciate that there are other ways to do this -- sed, etc. -- but I have similar and additional problems there, and I would like to be able to do this with vim.

I'm adding my comment as an answer:
Text editors usually don't like 'gigantic' lines (which is what you'll get with that replacement).
To test that if this is is due because of the 'big line' and not the substitution itself I did this test:
I created a simple ~500KB file with a script. No new line characters, just a single line. Then I tried to load the file with vim. Result? I had to kill it :-).
However, if on the same script I write some new lines every now and then, I have no problems opening the file.
Also, one thing you could try is the following: on vim, replace \n by \n\n if it is fast, then this should also confirm the 'big line' issue.

bulk file renaming in bash, to remove name with spaces, leaving trailing digits

Can a bash/shell expert help me in this? Each time I use PDF to split large pdf file (say its name is X.pdf) into separate pages, where each page is one pdf file, it creates files with this pattern
"X 1.pdf"
"X 2.pdf"
"X 3.pdf" etc...
The file name "X" above is the original file name, which can be anything. It then adds one space after the name, then the page number. Page numbers always start from 1 and up to how many pages. There is no option in adobe PDF to change this.
I need to run a shell command to simply remove/strip out all the "X " part, and just leave the digits, like this
1.pdf
2.pdf
3.pdf
....
100.pdf ...etc..
Not being good in pattern matching, not sure what regular expression I need.
I know I need something like
for i in *.pdf; do mv "$i$" ........; done
And it is the ....... part I do not know how to do.
This only needs to run on Linux/Unix system.

Use sed..
for i in *.pdf; do mv "$i" $(sed 's/.*[[:blank:]]//' <<< "$i"); done
And it would be simple through rename
rename 's/.*\s//' *.pdf

You can remove everything up to (including) the last space in the variable with this:
${i##* }
That's "star space" after the double hash, meaning "anything followed by space". ${i#* } would remove up to the first space.
So run this to check:
for i in *.pdf; do echo mv -i -- "$i" "${i##* }" ; done
and remove the echo if it looks good. The -i suggested by Gordon Davisson will prompt you before overwriting, and -- signifies end of options, which prevents things from blowing up if you ever have filenames starting with -.

If you just want to do bulk renaming of files (or directories) and don't mind using external tools, then here's mine: rnm
The command to do what you want would be:
rnm -rs '/.*\s//' *.pdf
.*\s selects the part before (and with) the last white space and replaces it with empty string.
Note:
It doesn't overwrite any existing files (throws warning if it finds an existing file with the target name).
And this operation is failsafe. You can get back the changes made by last rnm command with rnm -u.
Here's a list of documents for rnm.

Windows Batch File - Find and return string inside matching pattern

I'm using a batch file to identify and load fonts temporarily. It looks for strings like /FontFamily(Rubber Dinghy Rapids)/ occurring inside .ai and .pdf files.
Now if I do findstr /r FontFamily\(.*\) MyFile.ai, this command returns a hugely interminable line of crap data with FontFamily(Rubber Dinghy Rapids) lost somewhere in there. I ACTUALLY need it to return the value of .* it found inside - in this case Rubber Dinghy Rapids.
Can I do this more elegantly? Or maybe I can switch to using VBScript if it's more elegant there?
My current solution is slow as hell... nested for loops, with one of them delimiting the crap data by the ( character, then finding the line that says FontFamily(Rubber Dinghy Rapids then stripping out the FontFamily( string, leaving me finally with Rubber Dinghy Rapids.

I wrote an hybrid Batch-JScript program called FindRepl.bat that use JScript's regular expressions to search for strings in a file. Using my program you may solve your problem this way:
FindRepl.bat "FontFamily\((.*)\)" /$:1 < input.txt
You may get my program from this site.

Delete all lines upto some regex match

I want to delete everything from start of the document upto some regex match, such as _tmm. I wrote the following custom command:
command! FilterTmm exe 'g/^_tmm\\>/,/^$/mo$' | norm /_tmm<CR> | :0,-1 d
This doesn't work as expected. But when I execute these commands directly using the command line, they work.
Do you have any alternative suggestions to accomplish this job using custom commands?

It seems that you want to remove from beginning to the line above the matched line.
/pattern could have offset option. like /pattern/{offset}, :h / for detail, for your needs, you could do (no matter where your cursor is):
ggd/_tmm/-1<cr>
EDIT
I read your question twice, it seems that you want to do it in a single command line.
Your script has problem, normal doesn't support |, that is, it must be the last command.
try this line, if it works for you:
exe 'norm gg'|/_tmm/-1|0,.d

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Filename extraction with regex - regex

I need to be able to only extract the filename (info.txt) from a line like: 07/01/2010 07:25p 953 info.txt I've tried using this: /d+\s+\d+\s+\d+\s+(?.?)/, but it doesn't seem to work ...

\d\d:\d\d\w\s+\d+\s+(.*?)$ $1 will be the file name The problem with your original regex is that it forgets the special characters :, /, and (?.?) means nothing...

Assuming that the files have extension as .txt you can try. (?<=(\s))\w.txt

Why not just use the following regex: \w+\.\w+

Related

replace part of file name with wrong encoding

Why is this vim regex so expensive: s/\n/\\n/g

bulk file renaming in bash, to remove name with spaces, leaving trailing digits

Windows Batch File - Find and return string inside matching pattern

Delete all lines upto some regex match

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Filename extraction with regex - regex

I need to be able to only extract the filename (info.txt) from a line like: 07/01/2010 07:25p 953 info.txt I've tried using this: /d+\s+\d+\s+\d+\s+(?.?)/, but it doesn't seem to work ...

\d\d:\d\d\w\s+\d+\s+(.*?)$ $1 will be the file name The problem with your original regex is that it forgets the special characters :, /, and (?.?) means nothing...

Assuming that the files have extension as .txt you can try. (?<=(\s)*)\w*.txt

Why not just use the following regex: \w+\.\w+

Related

replace part of file name with wrong encoding

Why is this vim regex so expensive: s/\n/\\n/g

bulk file renaming in bash, to remove name with spaces, leaving trailing digits

Windows Batch File - Find and return string inside matching pattern

Delete all lines upto some regex match

Categories

Resources

Assuming that the files have extension as .txt you can try. (?<=(\s))\w.txt