Replace list of value in a text file - list

I have the following problem:
I have a huge file and i have to replace some value (more than one).
Ad example, I have to replace:
DOG with RED
CAT with BLUE
FISH with GREEN
...
...
n with N
Do you know some software that is able (putting in input a list value) to replace all the value of the list in one hit in the text?
EDIT:
My text file is something really big as a book or similar.
In this book i have many words that i have to replace with other words

You can use sed to substitute matching expressions, for example
sed -e 's/DOG/RED/g;s/CAT/BLUE/g' < inputFile > outputFile
You haven't specified if you wan this changed in place or not. You could clearly delete the old version afterwards if you were happy with the results.
.....
If you are on Windows some other answers will give you an equivalent or suggest tools such as cygwin: e.g. here

Related

Issues while processing zeroes found in CSV input file with Perl

Friends:
I have to process a CSV file, using Perl language and produce an Excel as output, using the Excel::Writer::XSLX module. This is not a homework but a real life problem, where I cannot download whichever Perl version (actually, I need to use Perl 5.6), or whichever Perl module (I have a limited set of them). My OS is UNIX. I can also use (embedding in Perl) ksh and csh (with some limitation, as I have found so far). Please, limit your answers to the tools I have available. Thanks in advance!
Even though I am not a Perl developer, but coming from other languages, I have already done my work. However, the customer is asking for extra processing where I am getting stuck on.
1) The stones in the road I found are coming from two sides: from Perl and from Excel particular styles of processing data. I already found a workaround to handle the Excel, but -as mentioned in the subject- I have difficulties while processing zeroes found in CSV input file. To handle the Excel, I am using the '0 way which is the final way for data representation that Excel seems to have while using the # formatting style.
2) Scenario:
I need to catch standalone zeroes which might be present in whichever line / column / cell of the CSV input file and put them as such (as zeroes) in the Excel output file.
I will go directly to the point of my question to avoid loosing your valuable time. I am providing more details after my question:
Research and question:
I tried to use Perl regex to find standalone "0" and replace them by whichever string, planning to replace them back to "0" at the end of processing.
perl -p -i -e 's/\b0\b/string/g' myfile.csv`
and
perl -i -ple 's/\b0\b/string/g' myfile.csv
Are working; but only from command line. They aren't working when I call them from the Perl script as follows:
system("perl -i -ple 's/\b0\b/string/g' myfile.csv")
Do not know why... I have already tried using exec and eval, instead of system, with the same results.
Note that I have a ton of regex that work perfectly with the same structure, such as the following:
system("perl -i -ple 's/input/output/g' myfile.csv")
I have also tried using backticks and qx//, without success. Note that qx// and backticks have not the same behavior, since qx// is complaining about the boundaries \b because of the forward slash.
I have tried using sed -i, but my System is rejecting -i as invalid flag (do not know if this happens in all UNIX, but at least happens in the one at work. However is accepting perl -i).
I have tried embedding awk (which is working from command line), in this way:
system `awk -F ',' -v OFS=',' '$1 == \"0\" { $1 = "string" }1' myfile.csv > myfile_copy.csv
But this works only for the first column (in command line) and, other than having the disadvantage of having extra copy file, Perl is complaining for > redirection, assuming it as "greater than"...
system(q#awk 'BEGIN{FS=OFS=",";split("1 2 3 4 5",A," ") } { for(i in A)sub(0,"string",$A[i] ) }1' myfile.csv#);
This awk is working from command line, but only 5 columns. But not in Perl using #.
All the combinations of exec and eval have also been tested without success.
I have also tried passing to system each one of the awk components, as arguments, separated by commas, but did not find any valid way to pass the redirector (>), since Perl is rejecting it because of the mentioned reason.
Using another approach, I noticed that the "standalone zeroes" seem to be "swallowed" by the Text::CSV module, thus, I get rid off it, and turned back to a traditional looping in csv line by line and a spliter for commas, preserving the zeroes in that way. However I found the "mystery" of isdual in Perl, and because of the limitation of modules I have, I cannot use the Dumper. Then, I also explored the guts of binaries in Perl and tried the $x ^ $x, which was deprecated since version 5.22 but valid till that version (I said mine is 5.6). This is useful to catch numbers vs strings. However, while if( $x ^ $x ) returns TRUE for strings, if( !( $x ^ $x ) ) does not returns TRUE when $x = 0. [UPDATE: I tried this in a devoted Perl script, just for this purpose, and it is working. I believe that my probable wrong conclusion ("not returning TRUE") was obtained when I did not still realize that Text::CSV was swallowing my zeroes. Doing new tests...].
I will appreciate very much your help!
MORE DETAILS ON MY REQUIREMENTS:
1) This is a dynamic report coming from a database which is handover to me and I pickup programmatically from a folder. Dynamic means that it might have whichever amount of tables, whichever amount of columns in each table, whichever names as column headers, whichever amount of rows in each table.
2) I do not know, and cannot know, the column names, because they vary from report to report. So, I cannot be guided by column names.
A sample input:
Alfa,Alfa1,Beta,Gamma,Delta,Delta1,Epsilon,Dseta,Heta,Zeta,Iota,Kappa
0,J5,alfa,0,111.33,124.45,0,0,456.85,234.56,798.43,330000.00
M1,0,X888,ZZ,222.44,111.33,12.24,45.67,0,234.56,0,975.33
3) Input Explanation
a) This is an example of a random report with 12 columns and 3 rows. Fist row is header.
b) I call "standalone zeroes" those "clean" zeroes which are coming in the CSV file, from second row onwards, between commas, like 0, (if the case is the first position in the row) or like ,0, in subsequent positions.
c) In the second row of the example you can read, from the beginning of the row: 0,J5,alfa,0, which in this particular case, are "words" or "strings". In this case, 4 names (note that two of them are zeroes, which required to be treated as strings). Thus, we have a 4 names-columns example (Alfa,Alfa1,Beta,Gamma are headers for those columns, but only in this scenario). From that point onwards, in the second row, you can see floating point (*.00) numbers and, among them, you can see 2 zeroes, which are numbers. Finally, in the third line, you can read M1,0,X888,Z, which are the names for the first 4 columns. Note, please, that the 4th column in the second row has 0 as name, while the 4th column in the third row has ZZ as name.
Summary: as a general picture, I have a table-report divided in 2 parts, from left to right: 4 columns for names, and 8 columns for numbers.
Always the first M columns are names and the last N columns are numbers.
- It is unknown which number is M: which amount of columns devoted for words / strings I will receive.
- It is unknown which number is N: which amount of columns devoted for numbers I will receive.
- It is KNOWN that, after the M amount of columns ends, always starts N, and this is constant for all the rows.
I have done a quick research on Perl boundaries for regex ( \b ), and I have not found any relevant information regarding if it applies or not in Perl 5.6.
However, since you are using and old Perl version, try the traditional UNIX / Linux style (I mean, what Perl inherits from Shell), like this:
system("perl -i -ple 's/^0/string/g' myfile.csv");
The previous regex should do the work doing the change at the start of the each line in your CSV file, if matches.
Or, maybe better (if you have those "standalone" zeroes, and want avoid any unwanted change in some "leading zeroes" string):
system("perl -i -ple 's/^0,/string,/g' myfile.csv");
[Note that I have added the comma, after the zero; and, of course, after the string].
Note that the first regex should work; the second one is just a "caveat", to be cautious.

How combine regex and find in shell script properly

I am trying to write a shell script which can take numbers from a text document and use these numbers to search for all pictures that include the numbers in their name.
I am working with find and it I got it to kinda work. If the name of the picture is exactly the same as the name in the text document, or if the name of the picture ends with whatever number is written in the text document it works. But if the number is in the middle of the name of the picture, it doesn't find it. So I have been trying to add regex to my find command but I haven't been successful.
input="/Users/unix/Desktop/pictures.txt"
input_2="/Users/unix/Desktop/2019/05/23"
while IFS= read -r -u3 line
do
find "$input_2" -iregex ".*${line}*.jpg"
done 3< "$input"
For example if the picture name is Right.jpg and my pictures.txt contains Right, it will find the file. If the picture is called leftRight.jpg, it will also find the File. But if it's something like leftRightleft.jpg, it won't find the picture, so I am a bit confused on how to use regex properly here.
Your regex is simply incorrect. If you break it down, it makes intuitive sense why:
.*${line}*.jpg
means:
.* -- any character repeated 0 or more times
${line}* -- the contents of ${line}, with the last character repeated 0 or more times
. -- any single character
jpg -- the literal characters jpg
So with your example, if you have Right in your file, you'd match actual files like these, which you probably don't want to match:
leftRigh.jpg
leftRighXjpg
leftRighttttttttt.jpg
leftRighttttttttttttttttttttttttttttttttjpg
What you probably want is:
.*${line}.*\.jpg

Using Sed or Script to Inline Edit Values in Data Files With Variable Spacing

I have a number of scripts that replace variables separate by white space.
e.g.
sed -i 's/old/new/g' filename.conf
But say I have
#NAME Weight Age Name
Boss 160.000 43 BOB
The below data is made more readable if it stays within the current alignment, so to speak. So if I'm writing a new double, I'd like to only overwrite the width of each of the fields.
My questions are:
1. How to I capture the patterns between values to preserve spaces?
2. Does sed feature a way to force a shell variable say ${FOOBAR} to be a certain width?
3a. If so how do I define this replace field width?
3b. If not what program in Linux is best suited for this truncation assuming I use a mix of number and string data?
EDIT 1
Let me give a couple more examples.
Let's say my file is:
#informative info on this config var.
VAR1 131 comment second_comment
#more informative info
VAR2 3.4 13132 yet_another_comment
#FOO THE VALUE WARNING
Foo 5.6 donteditthis_comment
#BAR ANOTHER VALUE WARNING
Bar 6.5 donteditthis_comment
#Yet another informative comment
VAR3 321
in my bash script I have:
#!/bin/bash
#Vars -- real script will have vals in arrays as
#multiple identically tagged config files will be altered
FOO='Foo'
BAR='Bar'
FOO_VAL_NEW='33.3333'
BAR_VAL_NEW='22.1111'
FILENAME='file.conf'
#Define sed patterns
#These could be inline, but are defined here for readability as they're long.
FOO_MATCH=${FOO}<...whatever special character can be used to capture whitespace...>'[0-9]*.*[0-9]*'
FOO_REPLACE=${FOO}<...whatever special characters to output captured whitespace...>${FOOD_VAL_NEW}
BAR_MATCH=${BAR}<...whatever special character can be used to capture whitespace...>'[0-9]*.*[0-9]*'
BAR_REPLACE=${FOO}<...whatever special characters to output captured whitespace...>${BAR_VAL_NEW}
#Do the inline edit ... will be in a loop to handle multiple
#identically tagged config files in full-fledged script.
sed -i "s/${FOO_MATCH}/${FOO_REPLACE}/g" ${FILENAME}
sed -i "s/${BAR_MATCH}/${BAR_REPLACE}/g" ${FILENAME}
My expected output is:
#informative info on this config var.
VAR1 131 comment second_comment
#more informative info
VAR2 3.4 13132 yet_another_comment
#FOO THE VALUE WARNING
Foo 33.3333 donteditthis_comment
#BAR ANOTHER VALUE WARNING
Bar 22.1111 donteditthis_comment
#Yet another informative comment
VAR3 321
Currently my script works... but there's a couple of annoyances/dangers.
PROBLEM 1
Currently to match the tag, I include the exact whitespace characters after it. E.g. for the given example I would define
FOO='Foo '
...as I'm unsure of how to capture ws characters and then output them in the the replace field.
This is nice for me, as I know I'm going to keep the spaces to the first field the same, to maintain readability. But if one of my users (this is for a public project) writes their own file and writes:
#FOO THE VALUE WARNING
Foo 22.0
Now my script is broken for them. I need to capture the whitespace chars in my match pattern, then output them in my output pattern. That way it will play nice with my file (optimally spaced for readability) but if someone wants to muck things up and not space things nicely it will still work for them as well, preserving their current spaces.
PROBLEM 2
Okay so we've read a tag and injected a consistent amount of spaces after it for the replace, based on what we found with a regex in the match.
But now I need to replace fields within the string.
Currently my script does this. However it isn't the clean style I show above in my desired input. For the above script, for example, I'd get:
#informative info on this config var.
VAR1 131 comment second_comment
#more informative info
VAR2 3.4 13132 yet_another_comment
#FOO THE VALUE WARNING
Foo 33.3333 donteditthis_comment
#BAR ANOTHER VALUE WARNING
Bar 22.1111 donteditthis_comment
#Yet another informative comment
VAR3 321
Well the values are right, but all that work for readability is ruined.... argghhh. Now if I opened the files in emacs and pressed the insert key I would be able to arrow over to the '3' in the Foo tagged value and then start typing the new value and get the output file I listed as desired. I want my sed inline edit to do the same thing... (Maybe as Kent showed this is possible with column?)
I want it to only overwrite on the trailing end. Further, I want it to start the next field (let's say I do end up editing the warning) at the same column it started at in the old file.
Put more simply I want to do a variant sed -i "s/${MATCH}/${REPLACE}/g" ${FILENAME} that writes replacement variables to a tagged line, starting at the same column that entry is at in the CURRENT version of the config. file.
This requires both saving the spaces and somehow coding to only write on the trailing end and pad the output so that the next entry stays in the same starting column if my new value's string is shorter than the old one.
In order to improve upon my current solution it is crucial to both maintain the column start position for each piece of data in a tagged entry and secondly to be able to match a tag with an arbitrary amount of trailing whitespace (which must be preserved)... these are trivial operations in a text editor (see the emacs example above) with the help of the insert key, but more complicated in the script scenario.
This way:
1. I make sure the values can be written no matter how other users space their file.
2. If users (like myself) do bother to match the fields column-wise to the comment above to improve readability, then the script won't mess this up, as it only writes on the trailing side.
Let me know if this is unclear at all.
If this can't be done or is overly onerous with sed alone, I'd be open to an efficient perl or python subscript that my bashscript would call, although obviously an inline solution (if concise and understandable) is preferable, if possible.
the column may help you, see the example below if it was you are looking for:
kent$ cat f
#NAME Weight Age Name
Boss 160.000 43 BOB
kent$ sed 's/160.000/7.0/' f|column -t
#NAME Weight Age Name
Boss 7.0 43 BOB
kent$ sed 's/160.000/7.7777777777/' f|column -t
#NAME Weight Age Name
Boss 7.7777777777 43 BOB
Using one of your sample datasets, you can get
$ doit Weight 160 7.555555 <<\EOD
#NAME Weight Age Name
Boss 160.000 43 BOB
Me 180 25 JAKE
EOD
#NAME Weight Age Name
Boss 7.555555555555 43 BOB
Me 180 25 JAKE
$
with this function:
$ doit ()
{
awk -v tag=$1 -v old=$2 -v new=$3 '
NR==1 { for (i=0;i++<NF;) field[$i]=i } # read column headers
$field[tag] == old {
$field[tag] = new
}
{print}
' | column -t
}
the useful part being loading the column headers into the field name->column map. With tag being "Weight", field[tag] evaluates to 2 for this input so $field[tag] is $2 i.e. the second field, the Weight column.
To answer your questions as asked:
My questions are:
How do I capture the patterns between values to preserve spaces?
Because of what Kent pointed out, it's probably best to regenerate spacing correct for the new data. If preserving the exact input spacing where at all possible, forcing lines with replacement values to have different alignment for some values, I'd say ask that again as a separate "no, really, help me here" question.
Does sed feature a way to force a shell variable say ${FOOBAR} to be a certain width?
sed's Turing complete, but that's as close to a feature as it's got for this. Sardonic humor aside, the only correct answer here is "no".
3b. If not what program in Linux is best suited for this truncation assuming I use a mix of number and string data?
Kent got that one. I didn't know about column, I get questions answered here I didn't even know to ask. For the value location and substitution awk should do you just fine.

bash rename using regex array substitution

i have a very similar question as for this post.
i would like to know how to rename occurances within a filename with designated substitutions. for example if the original file is called: 'the quick brown quick brown fox.avi' i would like to rename it to 'the slow red slow red fox.avi'.
i tried this:
new="(quick=>'slow',brown=>'red')"
regex="quick|brown"
rename -v "s/($regex)/$new{$1}/g" *
but no love :(
i also tried with
regex="qr/quick|brown/"
but this just gives errors. any idea what im doing wrong?
Based on your example, I think you want multiple substitutions (not just converting "quick brown" to "slow red" but converting a list of words to a list of new words. You can separate the substitutions with a semicolon. Here's a solution that works for your example:
rename -v 's/quick/slow/g;s/brown/red/g' *
And if you're really bent on using an array to map the old strings to the new string, you can cram even more Perl into the argument to rename (but at some point you might just write the Perl script as a stand-alone script):
rename -v '%::new=(quick=>"slow",brown=>"red");s/(quick|brown)/$::new{$1}/g' *

Text editor that searches within search results

Does anyone know of a text editor that searches within search results using regex?
I would like to perform a regex search on several text files and get a list of matches and then apply another regex search on the search results to further narrow down results. I would prefer a Windows GUI editor rather than a specialized editor with a steeper learning curve like Vim or Emacs.
You might want to look at PowerGrep. It's not exactly a text editor, but you can open files containing your search results within its built-in text editor, and edit stuff there.
The main thing though is that it allows you to search using a regex (or list of regexes), then apply an additional regex to each search result, before returning a 'final' result, which I believe is what you are asking for. Kind of hard to explain, but maybe you get the idea.
The only problem with PowerGrep is that its UI is not very good. To say it takes some getting used to is an understatement. But once you figure it out, you can do a lot of powerful stuff (search/replace, data collection, etc on multiple files whose file names can also be regexes).
The companion product EditPadPro by the same company is also a great editor that has a really good regex engine built-in (probably the same one as in PowerGrep), but it doesn't allow you to do the 'regex-applied-to-a-regex-result' that I think you are asking for.
Do you want list of files in which text matches both reg.exps or a list of lines?
In the first case you can do :
{ grep -l -R 'pattern1' * ; grep -l -R 'pattern2' * } | sort | uniq -d
Note that with Windows you can get those binaries from GnuWin32 and use nearly the same syntax in a batch file:
( grep -l -R "pattern1" *
grep -l -R "pattern2" *
) | sort | uniq -d
In the last case you can with vim use my answer to narrow quickfix results with reg.exp.
Of course you can also copy your search results to a buffer and do some linewise filtering.