Regex to batch rename files in for loop by only extracting first 5 characters - regex

I have many .xlsx files that look like XXX-A_2016(Final).xlsx and I am trying to write a shell script (bash) that will batch convert each one to csv, but also rename the output file to just "XXX-A.csv", so I think I need a regular expression within my for loop that extracts the first 5 characters of the input string (filename). I have xlsx2csv and I am using the following loop:
for i in *.xlsx;
do
filename=$(basename "$i" .xlsx);
outext=".csv"
xlsx2csv $i $filename$outext
done
There is a line missing that would take care of the file renaming prior to converting to csv.

You can use:
for i in *.xlsx; do
xlsx2csv "$i" "${i%_*}".csv
done
"${i%_*}" will strip anything after _ at the end of variable $i, giving us XXX-A as a result.

Related

How can I combine multiple text files, remove duplicate lines and split the remaining lines into several files of certain length?

I have a lot of relatively small files with about 350.000 lines of text.
For example:
File 1:
asdf
wetwert
ddghr
vbnd
...
sdfre
File 2:
erye
yren
asdf
jkdt
...
uory
As you can see line 3 of file 2 is a duplicate of line 1 in file 1.
I want a program / Notepad++ Plugin that can check and remove these duplicates in multiple files.
The next problem I have is that I want all lists to be combined into large 1.000.000 line files.
So, for example, I have these files:
648563 lines
375924 lines
487036 lines
I want them to result in these files:
1.000.000 lines
511.523 lines
And the last 2 files must consist of only unique lines.
How can I possibly do this? Can I use some programs for this? Or a combination of multiple Notepad++ Plugins?
I know GSplit can split files of 1.536.243 into files of 1.000.000 and 536.243 lines, but that is not enough, and it doesn't remove duplicates.
I do want to create my own Notepad++ plugin or program if needed, but I have no idea how and where to start.
Thanks in advance.
You have asked about Notepad++ and are thus using Windows. On the other hand, you said you want to create a program if needed, so I guess the main goal is to get the job done.
This answer uses Unix tools - on Windows, you can get those with Cygwin.
To run the commands, you have to type (or paste) them in the terminal / console.
cat file1 file2 file3 | sort -u | split -l1000000 - outfile_
cat reads the files and echoes them; normally, to the screen, but the pipe | gets the output of the command left to it and pipes it through to the command on the right.
sort obviously sorts them, and the switch -u tells it to remove duplicate lines.
The output is then piped to split which is being told to split after 1000000 lines by the switch -l1000000. The - (with spaces around) tells it to read its input not from a file but from "standard input"; the output in sort -u in this case. The last word, outfile_, can be changed by you, if you want.
Written like it is, this will result in files like outfile_aa, outfile_ab and so on - you can modify this with the last word in this command.
If you have all the files in on directory, and nothing else is in there, you can use * instead of listing all the files:
cat * | sort -u | split -l1000000 - outfile_
If the files might contain empty lines, you might want to remove them. Otherwise, they'll be sorted to the top and your first file will not have the full 1.000.000 values:
cat file1 file2 file3 | grep -v '^\s*$' | sort -u | split -l1000000 - outfile_
This will also remove lines that consist only of whitespace.
grep filters input using regular expressions. -v inverts the filter; normally, grep keeps only lines that match. Now, it keeps only lines that don't match. ^\s*$ matches all lines that consist of nothing else than 0 or more characters of whitespace (like spaces or tabs).
If you need to do this regularly, you can write a script so you don't have to remember the details:
#!/bin/sh
cat * | sort -u | split -l1000000 - outfile_
Save this as a file (for example combine.sh) and run it with
./combine.sh

Mass rename in shell script

I have a bunch of files which are of this format:
blabla.log.YYYY.MM.DD
Where YYYY.MM.DD is something like (2016.01.18)
I have quite a few folders with about 1000 files in each, so I wanted to have a simple script to rename them. I want to rename them to
blabla.log
So basically, I'm just stripping the date at the end. Here is what I have:
for f in [a-zA-Z]*.log.[0-9][0-9][0-9][0-9].[0-9][0-9].[0-9][0-9]; do
mv -v $f ${f#[0-9][0-9][0-9][0-9].[0-9][0-9].[0-9][0-9]};
done
This script outputs this:
mv: `blabla.log.2016.01.18' and `blabla.log.2016.01.18' are the same file
For more information:
I'm on windows, but I run this script in gitbash
For some reason, my gitbash doesn't recognize the "rename" command
Some regex patterns (like [0-9]{4} don't seem to work)
I'm really at a lost. Thanks.
EDIT: I need to rename every single file that has a date at the end and that is of the from: *.log.2016.01.18. They all need to keep their original names. All that should change is the removal of the date.
You have to use % instead of #: you want to remove from the end, not the start of your string.
Also, you're missing a . in what has to be removed, you don't want to end up with blabla.log..
Quoting the variable names prevents surprises when file names contain special characters.
Together:
mv -v "$f" "${f%.[0-9][0-9][0-9][0-9].[0-9][0-9].[0-9][0-9]}"

how to retrieve filename or extension within bash [duplicate]

This question already has answers here:
Extract filename and extension in Bash
(38 answers)
Closed 8 years ago.
i have a script that is pushing out some filesystem data to be uploaded to another system.
it would be very handy if i could tell myself what 'kind' of file each file actually is, because it will help with some querying later on down the road.
so, for example, say that my script is spitting out the following:
/home/myuser/mydata/myfile/data.log
/home/myuser/mydata/myfile/myfile.gz
/home/myuser/mydata/myfile/mod.conf
/home/myuser/mydata/myfile/security
/home/myuser/mydata/myfile/last
in the end, i'd like to see:
/home/myuser/mydata/myfile/data.log log
/home/myuser/mydata/myfile/myfile.gz gz
/home/myuser/mydata/myfile/mod.conf conf
/home/myuser/mydata/myfile/security security
/home/myuser/mydata/myfile/last last
there's gotta be a way to do this with regular expressions and sed, but i can't figure it out.
any suggestions?
EDIT:
i need to get this info via the command line. looking at the answers so far, i obviously have not made this clear. so with the example data i provided, assume that data is all being fed via greps and seds (data is already sterlized). i need to be able to pipe the example data to sed/grep/awk/whatever in order to produce the desired results.
Print last filed that are separated by a none alpha character.
awk -F '[^[:alpha:]]' '{ print $0,$NF }'
/home/myuser/mydata/myfile/data.log log
/home/myuser/mydata/myfile/myfile.gz gz
/home/myuser/mydata/myfile/mod.conf conf
/home/myuser/mydata/myfile/security security
/home/myuser/mydata/myfile/last last
This should work for you:
x='/home/myuser/mydata/myfile/security'
( IFS=[/.] && arr=( $x ) && echo ${arr[#]:(-1):1} )
security
x='/home/myuser/mydata/myfile/data.log'
( IFS=[/.] && arr=( $x ) && echo ${arr[#]:(-1):1} )
log
To extract the last element in a filename path:
filename=$(path##*/}
To extract characters after a dot in a filename:
extension=${filename##*.}
But (my comment) rather than looking at the extension, it might be better to use file. See man file.
As others have already answered, to parse the file names:
extension="${full_file_name##*.}" # BASH and Kornshell/POSIX only
filename=$(basename "$full_file_name")
dirname=$(dirname "$full_file_name")
Quotes are needed if file names could have spaces, tabs, or other strange characters in them.
You can also test whether a file is a directory or file or link with the test command (which is linked to [ so that test -f foo is the same as [ -f foo ].
However, you said: "it would be very handy if i could tell myself what kind of file each file actually is".
In that case, you may want to investigate the file command. This command will return the file type as determined by some sort of magic file (traditionally in /etc/magic), but newer implementations can use the user's own scheme. This can tell file type by extension and by the magic number in the file's header, or by looking at the first few lines in the file (looking for a regular expression ^#! .*/bash$ in the first line.
This extracts the last component after a slash or a dot.
awk -F '[/.]' '{ print $NF }'

awk script to remove ASCII from file type

Here is a simple command
file * | awk '/ASCII text/ {gsub(/:/,"",$1); print $1}' | xargs chmod -x
I am not able to understand the use of awk in the above as showed.
How is it working?
There was a deleted answer which came pretty close to avoiding the problems with whitespace or colons in filenames and the output of file. I've voted to undelete the answer, but I'm going to go ahead and post some improvements to it and add some explanation.
file -0 * | awk -F '\0' '$2 ~ /ASCII text/ {print $1 "\0"}' | xargs -0 chmod -x
Since nulls aren't allowed in filenames, it's safe to use them as delimiters. Each step in this pipeline uses nulls. file outputs them, awk accepts them in input and outputs them and xargs accepts them in input. I've also made the match specific to the description field so it won't trigger a false positive in the perhaps unusual case of a file which is named something like "ASCII text" but in fact its contents are not.
As others have said, the AWK command you posted matches lines of output from the file command that include "ASCII text" somewhere in the line. Then every colon is deleted (since gsub() is a global substitution) from field one which is the colon-space-delimited filename. A potential problem occurs if the filename contains either a colon or a space (or both or multiples). The filename will get truncated and the chmod will fail or might even be falsely triggered on a file with a similar name (e.g. "foo bar" and "foo" both exist, "foo" is not an ASCII text file so you don't want it to be touched, but "foo bar" gets truncated to "foo" and oops!). The reason spaces are potential problems is that AWK, by default, does field splitting on spaces and tabs.
Breakdown of the AWK portion of the pipeline you posted:
/ASCII text/ { - for each line that matches the regular expression
gsub(/:/,"",$1); - for each colon (as a regular expression) in the first field, substitute an empty string
print $1} - print the thus modified first field
I'm guessing but it looks like it's extracting the part before the : in the output of the file command (i.e. the filename). The gsub part will remove the : in the filename and so something like foo.txt: ASCII text will become foo.txt ASCII text. Then, the print will print the first item in the space separated list (in this case, the filename foo.txt). All these files will be made unexecutable by the chmod.
This looks quite tedious. It's probably easier to just say awk -F: '{print $1}' after grepping instead of the whole substitution trick. Also, this will break if the filename has spaces in it.
It's using file to determine the type (contents) of each file, then selecting the ones that are ASCII text and removing everything from the first colon (which is assumed to be the separator between the filename and file type; this is fragile when file names have colons in them; as Noufel noted, it's also doing it the hard way), then using xargs to batch then up and clear the execute bits. (The usual reason for doing this is files transferred from Windows, which doesn't have execute bits so often all files end up with execute bits set as seen by Unixes.)
The breakage on spaces is fixable; xargs understands quoting. I would break on the last colon instead of the first, though, since file doesn't usually include colons in its ASCII text type strings.

What Vim command to use to delete all text after a certain character on every line of a file?

Scenario:
I have a text file that has pipe (as in the | character) delimited data.
Each field of data in the pipe delimited fields can be of variable length, so counting characters won't work (or using some sort of substring function... if that even exists in Vim).
Is it possible, using Vim to delete all data from the second pipe to the end of the line for the entire file? There are approx 150,000 lines, so doing this manually would only be appealing to a masochist...
For example, change the following lines from:
1111|random sized text 12345|more random data la la la|1111|abcde
2222|random sized text abcdefghijk|la la la la|2222|defgh
3333|random sized text|more random data|33333|ijklmnop
to:
1111|random sized text 12345
2222|random sized text abcdefghijk
3333|random sized text
I'm sure this can be done somehow... I hope.
UPDATE: I should have mentioned that I'm running this on Windows XP, so I don't have access to some of the mentioned *nix commands (cut is not recognized on Windows).
:%s/^\v([^|]+\|[^|]+)\|.*$/\1/
You can also record a macro:
qq02f|Djq
and then you will be able to play it with 100#q to run the macro on the next 100 lines.
Macro explanation:
qq: starts macro recording;
0: goes to the first character of the line;
2f|: finds the second occurrence of the | character on the line;
D: deletes the text after the current position to the end of the line;
j: goes to the next line;
q: ends macro recording.
If you don't have to use Vim, another alternative would be the unix cut command:
cut -d '|' -f 1-2 file > out.file
Instead of substitution, one can use the :normal command to repeat
a sequence of two Normal mode commands on each line: 2f|, jumping
to the second | character on the line, and then D, deleting
everything up to the end of line.
:%norm!2f|D
Just another Vim way to do the same thing:
%s/^\(.\{-}|\)\{2}\zs.*//
%s/^\(.\{-}\zs|\)\{2}.*// " If you want to remove the second pipe as well.
This time, the regex matches as few characters as possible (\{-}) that are followed by |, and twice (\{2}), they are ignored to replace all following text (\zs) by nothing (//).
You can use :command to make a user command to run the substitution:
:command -range=% YourNameHere <line1>,<line2>s/^\v([^|]+\|[^|]+)\|.*$/\1/
You can also do:
:%s/^\([^\|]\+|[^\|]\+\)\|.*$/\1/g
Use Awk:
awk -F"|" '{$0=$1"|"$2}1' file
I've found that vim isn't great at handling very large files. I'm not sure how large your file is. Maybe cat and sed together would work better.
Here is a sed solution:
sed -e 's/^\([^|]*|[^|]*\).*$/\1/'
This will filter all lines in the buffer (1,$) through cut to do the job:
:1,$!cut -d '|' -f 1-2
To do it only on the current line, try:
:.!cut -d '|' -f 1-2
Why use Vim? Why not just run
cat my_pipe_file | cut -d'|' -f1-2