replace part of file name with wrong encoding - regex

Need some guidance how to solve this one. Have 10 000s of files in multiple subfolders where the encoding got screwed up. Via ls command I see a filename named like this 'F'$'\366''ljesedel.pdf', that includes the ' at beginning and end. That's just one example where the Swedish characters åäö got wrong, in this example this should have been 'Följesedel.pdf'. If If I run
#>find .
Then I see a list of files like this:
./F?ljesedel.pdf
Not the same encoding. How on earth solving this one? The most obvious ways:
myvar='$'\366''
char="ö"
find . -name *$myvar* -exec rename 's/$myvar/ö' {} \;
and other possible ways fails since
find . -name cannot find it due to the ? instead of the "real" characters " '$'\366'' "
Any suggestions or guidance would be very much appreciated.

The first question is what encoding your terminal expects. Make sure that is UTF-8.
Then you need to find what bytes the actual filename contains, not just what something might display it as. You can do this with a perl oneliner like follows, run in the directory containing the file:
perl -E'opendir my $dh, "."; printf "%s: %vX\n", $_, $_ for grep { m/jesedel\.pdf/ } readdir $dh'
This will output the filename interpreted as UTF-8 bytes (if you've set your terminal to that) followed by the hex bytes it actually contains.
Using that you can determine what your search pattern should be. Your replacement must be the UTF-8 encoded representation of ö, which it will be by default as part of the command arguments if your terminal is set to that.

I'm not an expert - but it might not be a problem with the file name (which seems to hold the correct Unicode file name) - but with the way ls (and many other utilities) show the name to the terminal.
I was able to show the correct name by setting the terminal character encoding to Unicode. Also I've noticed the GUI programs (file manager, etc), were able to show the correct file name.
Gnome Terminal: "Terminal .. set character encoding - Unicode UTF8
It is still a challenge with many utilities to 'select' those files (e.g., REGEXP, wildcard). In few cases, you will have to select those character using '*' pattern. If this is a major issue considering using Ascii only - may be use the 'o' instead of 'ö'. Not sure if this is acceptable.

Related

Iterating over directory with specified path in Bash

pathToBins=$1
bins="${pathToBins}contigs.fa.metabat-bins-*"
for fileName in $bins
do
echo $fileName
done
My goal is to attach a path to my file name. I can iterate over a folder and get the file name when I don't attach the path. My challenge is when I add the path echo fileName my regular expression no longer works and I get "/home/erikrasmussen/Desktop/Script/realLargeMetaBatBinscontigs.fa.metabat-bins-*" where the regular expression '*' is treated like a string. How can I get the path and also the full file name while iterating over a folder of files?
Although I don't really know how your files are arranged on your hard drive, a casual glance at "/home/erikrasmussen/Desktop/Script/realLargeMetaBatBinscontigs.fa.metabat-bins-*" suggests that it is missing a / before contigs. If that is the case, then you should change your definition of bins to:
bins="${pathToBins}/contigs.fa.metabat-bins-*"
However, it is much more robust to use bash arrays instead of relying on filenames to not include whitespace and metacharacters. So I would suggest:
bins=(${pathToBins}/contigs.fa.metabat-bins-*)
for fileName in "${bins[#]}"
do
echo "$fileName"
done
Bash normally does not expand a pattern which doesn't match any file, so in that case you will see the original pattern. If you use the array formulation above, you could set the bash option nullglob, which will cause the unmatched pattern to vanish instead, leaving an empty array.

bulk file renaming in bash, to remove name with spaces, leaving trailing digits

Can a bash/shell expert help me in this? Each time I use PDF to split large pdf file (say its name is X.pdf) into separate pages, where each page is one pdf file, it creates files with this pattern
"X 1.pdf"
"X 2.pdf"
"X 3.pdf" etc...
The file name "X" above is the original file name, which can be anything. It then adds one space after the name, then the page number. Page numbers always start from 1 and up to how many pages. There is no option in adobe PDF to change this.
I need to run a shell command to simply remove/strip out all the "X " part, and just leave the digits, like this
1.pdf
2.pdf
3.pdf
....
100.pdf ...etc..
Not being good in pattern matching, not sure what regular expression I need.
I know I need something like
for i in *.pdf; do mv "$i$" ........; done
And it is the ....... part I do not know how to do.
This only needs to run on Linux/Unix system.
Use sed..
for i in *.pdf; do mv "$i" $(sed 's/.*[[:blank:]]//' <<< "$i"); done
And it would be simple through rename
rename 's/.*\s//' *.pdf
You can remove everything up to (including) the last space in the variable with this:
${i##* }
That's "star space" after the double hash, meaning "anything followed by space". ${i#* } would remove up to the first space.
So run this to check:
for i in *.pdf; do echo mv -i -- "$i" "${i##* }" ; done
and remove the echo if it looks good. The -i suggested by Gordon Davisson will prompt you before overwriting, and -- signifies end of options, which prevents things from blowing up if you ever have filenames starting with -.
If you just want to do bulk renaming of files (or directories) and don't mind using external tools, then here's mine: rnm
The command to do what you want would be:
rnm -rs '/.*\s//' *.pdf
.*\s selects the part before (and with) the last white space and replaces it with empty string.
Note:
It doesn't overwrite any existing files (throws warning if it finds an existing file with the target name).
And this operation is failsafe. You can get back the changes made by last rnm command with rnm -u.
Here's a list of documents for rnm.

Filename extraction with regex

I need to be able to only extract the filename (info.txt) from a line like:
07/01/2010 07:25p 953 info.txt
I've tried using this: /d+\s+\d+\s+\d+\s+(?.?)/, but it doesn't seem to work ...
How about
/\S+$/
I.e. the longest possible string of non-whitespace at the end of the line.
(Hard to know for sure without more info about the possible inputs.)
As #J V pointed out, filenames with spaces in them (like his username) will not be parsed properly by the above regexp. We don't know from the question whether that's possible.
But I have a suspicion that we're looking at the output of Windows DIR command, or something very similar. In that case, the most reliable approach might be just to hack off the first 39 characters and keep the rest:
/^.{39}(.+)$/
Then $1 will contain the filename.
Better option:
But if you are using Windows DIR (as per your new comment), and you can control the DIR command, try
DIR /b
which removes the unneeded cruft (assuming you don't need the date, size etc.) and gives you one filename per line.
OK, you're using a Unix dir (per newer comment). The CentOS dir I have outputs one file per line, nothing else, when you give it no command line options. Chances are very good that whichever dir you're using can be persuaded to output filenames like that... then you wouldn't have to worry about using a regex that may or may not be correct for every possible input. Try man dir or dir --help to find out what command-line options to use.
\d\d:\d\d\w\s+\d+\s+(.*?)$
$1 will be the file name
The problem with your original regex is that it forgets the special characters :, /, and (?.?) means nothing...
Assuming that the files have extension as .txt you can try.
(?<=(\s)*)\w*.txt
Why not just use the following regex:
\w+\.\w+

executing filenames with spaces in cmd pmt Passed from c++ program

I am currently working on getting my program to execute a program (such as power point) and then beside it the path to the file I want to open. My program is getting the file's path by using:
dirIter2->path()
I get the 2 paths of the program and file, Merge them as one string and pass them into the following:
system(PathTotal.c_str())
this is working great but my only issue is that when the file name has a space in its name command prompt says it cannont find the file (becuase it thinks the file name ends when it gets to the first space. I have tried to wrap it with quotes but it is the acutal file name that need to be wrapped.
(eg. i have tried "C:\users\bob\john is cool" but it needs to be like this: C:\users\bob\"john is cool")
Does anyone have any suggestions on how I could fix this? I was thinking about getting the path to the folder to where the file and then getting the file name. I would wrap the file name with quotes then add it to the folder's path. I have tried using the ->path() like above but the only problem is that it only goes to outside of the folder's directory?
Is there a boost command that could get the enitre path to the file without getting the file aswell?
I am not commited to this idea if anyone has any better suggestions
Thanks
In both C and C++, the '\' is an escape character. For certain things (like '\n' or '\t') it inserts a control code; otherwise, it just gives you the next character.
So if you do something like:
fopen("C:\users\bob\john is cool", "r");
it's going to try to open a file named
C:usersbobjohn is cool
If you want those '\' characters in the output, you have to escape them. So you'd want:
fopen("C:\\users\\bob\\john is cool", "r");
On Windows with Visual Studio, I've also successfully used Unix-style separators:
fopen("C:/users/bob/john is cool", "r");
And in fact, you can mix them up:
fopen("C:/users\\bob/john is cool", "r");
I'm not familiar with C string operations, but couldn't you do the following rather easily?
int i = path.lastIndexOf("\\"); //Find the index of the last "\"
String quotedPath = path.substring(0, i+1); //Get the path up until the last "\"
quotedPath += "\"" + path.substring(i+2) + "\""; //Add quotes and concatenate the filename
Sorry for the Java, its the closest thing that I'm familiar with. I've made this a community wiki in case someone can edit the code to the equivalent C.
I'd also like to add that sometimes it is necessary to escape spaces as in the following:
cmd.exe -C C:/Program\ Files/Application\ Folder/Executable\ with\ spaces.exe
or
cmd.exe -C C:\\Program\ Files\\Application\ Folder\\Executable\ with\ spaces.exe

Find Non-UTF8 Filenames on Linux File System

I have a number of files hiding in my LANG=en_US:UTF-8 filesystem that have been uploaded with unrecognisable characters in their filename.
I need to search the filesystem and return all filenames that have at least one character that is not in the standard range (a-zA-Z0-9 and .-_ etc.)
I have been trying to following but no luck.
find . | egrep [^a-zA-Z0-9_\.\/\-\s]
I'm using Fedora Code 9.
convmv might be interesting to you. It doesn't just find those files, but also supports renaming them to correct file names (if it can guess what went wrong).
find . | perl -ne 'print if /[^[:ascii:]]/'
find . | egrep [^a-zA-Z0-9_./-\s]
Danger, shell escaping!
bash will be interpreting that last parameter, removing one level of backslash-escaping. Try putting double quotes around the "[^group]" expression.
Also of course this disallows a lot more than UTF-8. It is possible to construct a regex to match valid UTF-8 strings, but it's rather ugly. If you have Python 2.x available you could take advantage of that:
import os.path
def walk(dir):
for child in os.listdir(dir):
child= os.path.join(dir, child)
if os.path.isdir(child):
for descendant in walk(child):
yield descendant
yield child
for path in walk('.'):
try:
u= unicode(path, 'utf-8')
except UnicodeError:
# print path, or attempt to rename file
I had a similar problem to the OP for which I was given a solution on Superuser (see also further comments) that I found more satisfactory than the "convmv solution", although I appreciate to have discovered comvmv too.