awk: Either modify or append a line, based on its existence - regex

I have a small awk script that does some in-place file modifications (to a Java .properties file, to give you an idea). This is part of a deployment script affecting a bunch of users.
I want to be able to set defaults, leaving the rest of the file at the user's preferences. This means appending a configuration line if it is missing, modifying it if it is there, leaving everything else as it is.
Currently I use something like this:
# initialize
BEGIN {
some_value_set = 0
other_value_set = 0
some_value_default = "some.value=SOME VALUE"
other_value_default = "other.value=OTHER VALUE"
}
# modify existing lines
{
if (/^some\.value=.*/)
{
gsub(/.*/, some_value_default)
some_value_set = 1
}
else if (/^other\.value=.*/)
{
gsub(/.*/, other_value_default)
other_value_set = 1
}
print $0
}
# append missing lines
END {
if (some_value_set == 0) print some_value_default
if (other_value_set == 0) print other_value_default
}
Especially when the number of lines I want to control gets larger, this is increasingly cumbersome. My awk knowledge is not all that great, and the above just feels wrong - how can I streamline this?
P.S.: If possible, I'd like to stay with awk. Please don't just recommend that using Perl/Python/whatever would be much easier. :-)

BEGIN {
defaults["some.value"] = "SOME VALUE"
defaults["other.value"] = "OTHER VALUE"
}
{
for (key in defaults) {
pattern = key
gsub(/\./, "\\.", pattern)
if (match($0, "^" pattern "=.*")) {
gsub(/=.*/, "=" defaults[key])
delete defaults[key]
}
}
print $0
}
END {
for (key in defaults) {
print key "=" defaults[key]
}
}

My AWK is rusty, so I won't provide actual code.
Initialize an array with the regular expressions and values.
For each line, iterate the array and do appropriate substitutions. Clean out used entries.
At end, iterate the array and append lines for remaining entries.

Related

How to remove newlines inside csv cells using regex/terminal tools?

I have a csv file where some of the cells have newline character inside. For example:
id,name
01,"this is
with newline"
02,no newline
I want to remove all the newline characters inside cells.
How to do it with regex or with other terminal tools generically without knowing number of columns in advance?
This is actually a harder problem than it looks, and in my opinion, means that regex isn't the right solution. Because you're dealing with quoting/escaped strings, spanning multiple 'lines' you end up with a complicated and difficult to read regex. (It's not impossible, it's just messy).
I would suggest instead - use a parser. Perl has one in Text::CSV and it goes a bit like this:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new( { binary => 1, eol => "\n" } );
while ( my $row = $csv->getline( \*ARGV ) ) {
s/\n/ /g for #$row;
$csv->print( \*STDOUT, $row );
}
This will take files as piped in/specified on command line - that's what \*ARGV does - it's a special file handle that lets you do ... basically what sed does:
somecommand.sh | myscript.pl
myscript.pl filename_to_process
The ARGV filehandle doe either automagically. (You could explicitly open a file or use \*STDIN if you prefer)
I suspect that instead of removing the newline you actually want to replace it with a space. If your input file is as simple as it looks this should do it for you:
$ awk '{ORS=( (c+=gsub(/"/,"&"))%2 ? FS : RS )} 1' file
id,name
01,"this is with newline"
02,no newline
If you are using this xlsx2csv tool, it has this option:
-e, --escape Escape \r\n\t characters
Use it, and then replace \n as needed, like (if \n should be replaced by the empty string):
sed 's/\\n//g' filein.csv` > fileout.csv
In one pass:
PATH/TO/xlsx2csv.py -e filein.xlsx | sed 's/\\n//g' > fileout.csv
How to do it with regex or with other terminal tools generically without knowing number of columns in advance?
I don't think a regex is the most appropriate approach and might end up being quite complicated. Instead, I think a separate program to process the files might be easier to maintain in the long-term.
Since you're OK with any terminal tools, I've chosen python, and the code's below:
#!/usr/bin/python3 -B
import csv
import sys
with open(sys.argv[1]) as csvfile:
reader = csv.reader(csvfile)
for row in reader:
stripped = [col.replace('\n', ' ') for col in row]
print(','.join(stripped))
I think the code above is very straightforward and easy to understand, without a need for complicated regular expressions.
The input file here has the following contents:
id,name
01,"this is
with newline"
02,no newline
To prove it works, its output is reproduced below:
➜ ~ ./test.py input.csv
id,name
01,this is with newline
02,no newline
You could call the python script from some other program and feed filenames to it. You just need to add a minor update for the python program to write out files, if that's what you really need.
I've replaced the newlines with spaces to avoid a potentially unwanted concatenation (e.g. this iswith newline), but you can replace the newline with whatever you want, including the empty string ''.
I have written a method to remove the embedded new line inside the cell. The method below returns a java.util.List object that contains all rows in the CSV file
List<String> getAllRowsInCSVFileAsList(File selectedCSVFile){
FileReader fileReader = null;
BufferedReader reader = null;
List<String> values = new ArrayList<String>();
try{
fileReader = new FileReader(selectedCSVFile);
reader = new BufferedReader(fileReader);
String line = reader.readLine();
String previousLine = "";
//
boolean intendLineInCell = false;
while(line != null){
if(intendLineInCell){
if(line.indexOf("\"") != -1 && line.indexOf("\"") == line.lastIndexOf("\"")){
previousLine += line;
values.add(previousLine);
previousLine = "";
intendLineInCell = false;
} else if(line.indexOf("\"") != -1 && line.indexOf("\"") != line.lastIndexOf("\"")){
if(getTotalNumberOfCharacterSequenceOccurrenceInString("\"", line) % 2 == 0){
previousLine += line;
}else{
previousLine += line;
values.add(previousLine);
previousLine = "";
intendLineInCell = false;
}
} else{
previousLine += line;
}
}else{
if(line.indexOf("\"") == -1){
values.add(line);
}else if ((line.indexOf("\"") == line.lastIndexOf("\"")) && line.indexOf("\"") != -1){
intendLineInCell = true;
previousLine = line;
}else if(line.indexOf("\"") != line.lastIndexOf("\"") && line.indexOf("\"") != -1){
values.add(line);
}
}
line = reader.readLine();
}
}catch(IOException ie){
ie.printStackTrace();
}finally{
if(fileReader != null){
try {
fileReader.close();
} catch (IOException e) {
e.printStackTrace();
}
}
if(reader != null){
try {
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
return values;
}
int getTotalNumberOfCharacterSequenceOccurrenceInString(String characterSequence, String text){
int count = 0;
while(text.indexOf(characterSequence) != -1){
text = text.replaceFirst(characterSequence, "");
count++;
}
return count;
}
Imagine you are creating a csv file with one row and five columns and in the 4th cell you have an embedded new line(enter inside the cell)
Your data will be look like below (actually we have only one row in csv but if you opened it in notepad it would look like 2 rows).
dinesh,kumar,24,"23
tambaram india",green
If there is a enter inside the cell could be like below
"23
tambaram india"
That cell starts with double quote(") and ends with double quote(").
Through using the double quote(") while reading the line if there is a double quote(") we can understand there is a embedded enter inside the cell.
The code concats the next line with that line and checks whether there is an end double quote(") or not. If there is, it adds a new row in the java.util.List object else it concats the next line and check it for end double quote(") and so on. Here I have explained for one cell, but the method also works if the row has a lot of cells with embedded enter.
Open the *csv file with notepadd++ and then press Ctrl+ H. Go to tab replace and enter to search box the "newline" and then write to replace the word you want to replace or let it empty if you want.

Notepad++ or UltraEdit: regex remove special duplicates

I need to remove duplicates if
key = anything
but NOT
key=anything
the key can be anything too
e.g.
edit_home=home must be in place
while
edit_home = home or even other string must be removed IF edit_home is a duplicate
for all the lines of the document
thank you
p.s. clearer example:
one=you are
two=we are
three_why=8908908
one = good
two = fine
three_4 = best
three_why = win
from that list i only need to keep:
one=you are
two=we are
three_why=8908908
three_4 = best // because three_4 doesn't have a duplicate
I found a method to do it, but I would need a better search list support by regex or a plugin or a direct regex (which I don't know).
That is: I have two files to compare.
One has the full keys, the other has incomplete.
I merge in a new file all the keys from the first file with those ones of the second, in groups (because the keys are in groups e.g. many keys titled one, many titled two and so on...). Then I regex replace all the keys in the new file by
find (.*)(\s\=\s) replace with \1\=
So they all become key=anything
Then I replace everything after = with empty to isolate the keys.
Then remove the duplicates.
At this point I have trouble to do something like
^.*(^keyone\b|^keytwo\b|^keythree\b).*$
to find all those keys in the document I need. So from that I can select all and replace with the correct keys.
Why? Because in this example the keys are 3 only BUT indeed the keys are many and the find field breaks at a certain point.
How to do it right?
Update: I found Toolbucket plugin which allows to search for many strings, but another issue is that in addition to duplicate, I also have to remove the original.
That is, if I find 2 times the same key "one" I have to remove all the lines containing one.
Ctrl + F
Find tab
Find what: ^.*\S=\S.*$
Find All in Current Document
Copy result from result window to a new window (the list of Line 1: Line 2: Line 3: ...)
Ctrl + F
Replace tab
(the following will remove the leading "Line number:" from every line)
Find what: ^.*?\d:\s
Replace with: Empty
ok, after all that i wrote, one solution could be (therefore, once i have the merged keys)
(?m)^(.*)$(?=\r?\n^(?!\1).*(?s).*?\1)
with this i can mark/highlight all the duplicated keys :-) so then i can manage those only, removing them from the first list and adding what remains to the second file...
If someone has a solution with a direct regex will be really appreciated
Here is a commented UltraEdit script for this task.
// Note: This script does not work for large files as it loads the
// entire file content into very limited scripting memory for fast
// processing even with multiple GB of RAM installed.
if (UltraEdit.document.length > 0) // Is any file opened?
{
// Define environment for this script and select entire file content.
UltraEdit.insertMode();
UltraEdit.columnModeOff();
UltraEdit.activeDocument.selectAll();
// Determine line termination used currently in active file.
var sLineTerm = "\r\n";
if (typeof(UltraEdit.activeDocument.lineTerminator) == "number")
{
// The two lines below require UE v16.00 or UES v10.00 or later.
if (UltraEdit.activeDocument.lineTerminator == 1) sLineTerm = "\n";
else if (UltraEdit.activeDocument.lineTerminator == 2) sLineTerm = "\r";
}
else // This version of UE/UES does not offer line terminator property.
{
if (UltraEdit.activeDocument.selection.indexOf(sLineTerm) < 0)
{
sLineTerm = "\n"; // Not DOS, perhaps UNIX.
if (UltraEdit.activeDocument.selection.indexOf(sLineTerm) < 0)
{
sLineTerm = "\r"; // Also not UNIX, perhaps MAC.
if (UltraEdit.activeDocument.selection.indexOf(sLineTerm) < 0)
{
sLineTerm = "\r\n"; // No line terminator, use DOS.
}
}
}
}
// Get all lines of active file into an array of strings
// with each string being one line from active file.
var asLines = UltraEdit.activeDocument.selection.split(sLineTerm);
var nTotalLines = asLines.length;
// Process each line in the array.
for(var nCurrentLine = 0; nCurrentLine < asLines.length; nCurrentLine++)
{
// Skip all lines not containing or starting with an equal sign.
if (asLines[nCurrentLine].indexOf('=') < 1) continue;
// Get string left to equal sign with tabs/spaces trimmed.
var sKey = asLines[nCurrentLine].replace(/^[\t ]*([^\t =]+).*$/,"$1");
// Skip lines beginning with just tabs/spaces left to equal sign.
if (sKey.length == asLines[nCurrentLine].length) continue;
var_dump(sKey);
// Build the regular expression for the search in all other lines.
var rRegSearch = new RegExp("^[\\t ]*"+sKey+"[\\t ]*=","g");
// Ceck all remaining lines for a line also starting with
// this key string case-sensitive with left to an equal sign.
var nLineCompare = nCurrentLine + 1;
while(nLineCompare < asLines.length)
{
// Does this line also has this key left to equal
// sign with or without surrounding spaces/tabs?
if (asLines[nLineCompare].search(rRegSearch) < 0)
{
nLineCompare++; // No, continue on next line.
}
else // Yes, remove this line from array.
{
asLines.splice(nLineCompare,1);
}
}
}
// Was any line removed from the array?
if (nTotalLines == asLines.length)
{
UltraEdit.activeDocument.top(); // Cancel the selection.
UltraEdit.messageBox("Nothing found to remove!");
}
else
{
// If version of UE/UES supports direct write to clipboard, use
// user clipboard 9 to paste the lines into file with overwritting
// everything as this is much faster than using write command in
// older versions of UE/UES.
if (typeof(UltraEdit.clipboardContent) == "string")
{
var nActiveClipboard = UltraEdit.clipboardIdx;
UltraEdit.selectClipboard(9);
UltraEdit.clipboardContent = asLines.join(sLineTerm);
UltraEdit.activeDocument.paste();
UltraEdit.clearClipboard();
UltraEdit.selectClipboard(nActiveClipboard);
}
else UltraEdit.activeDocument.write(asLines.join(sLineTerm));
var nRemoved = nTotalLines - asLines.length;
UltraEdit.activeDocument.top();
UltraEdit.messageBox("Removed " + nRemoved + " line" + ((nRemoved != 1) ? "s" : "") + " on updated file.");
}
}
Copy this code and paste it into a new ASCII file using DOS line terminators in UltraEdit.
Next use command File - Save As to save the script file for example with name RemoveDuplicateKeys.js into %AppData%\IDMComp\UltraEdit\MyScripts or wherever you want to have saved your UltraEdit scripts.
Open Scripting - Scripts and add the just saved UltraEdit script to the list of scripts. You can enter a description for this script, too.
Open the file with the list, or make this file active if it is already opened in UltraEdit.
Run the script by clicking on it in menu Scripting, or by opening Views - Views/Lists - Script List and double clicking on the script.

validating HTML fields in form using regex using perl

I have a couple of quick questions regarding using regex to validate some fields in a form. But I seem to be having some problems.
so here is the code
$userNameReg = "[a-zA-Z0-9_]+";
$passwordReg = "([a-zA-Z]*)([A-Z]+)([0-9]+)";
$emailReg = "[a-zA-Z0-9_]#[a-zA-Z]\.[a-zA-Z]{2,3}";
if ($onLoad !=1)
{
#controlValue = ($userName, $password, $phoneNumber, $email);
#regex = ($userNameReg, $passwordReg, "phoneNumber", $emailReg);
#validated;
for ($i=0; $i<4; $i++)
{
$retVal= validatecontrols ($controlValue[$i], $regex[$i]);
if ($retVal)
{
$count++;
}
if (!$retVal)
{
$validated[$i]="*"
}
}
sub validatecontrols
{
$ctrlVal = shift();
$regexVal = shift();
if ($ctrlVal =~ /$regexVal/)
{
return 1;
}
if ($ctrlVal !~ /$regexVal/)
{
return 0;
}
}
}
So what happens is that it still validates special characters, and I can't understand why. It does throw a flag if I enter a single special character but if its part of a word in the middle, beginning or end it validates.
Also please disregard the phone number part, because I haven't gotten to that part yet. I still have to create a regex that validates the phone number, digits only, first digit greater than 2.
Thank you all in advance for your help and insight.
Cheers
My guess is that you're missing start/end anchors. So [a-zA-Z0-9_]+ should be ^[a-zA-Z0-9_]+$. This way pattern will only match full string.
Also I strongly recommend you to enable use strict;. It can save you from a lot of mistype errors. Just add following to the beginning of the script:
use strict;
use warnings;
This will force perl to only allow defined variables. In most case you'll need to add my to first use of your variables (for example my $ctrlVal).
In validatecontrols you don't need second if statement. You can just return false like this:
sub validatecontrols
{
my $ctrlVal = shift();
my $regexVal = shift();
if ($ctrlVal =~ /$regexVal/)
{
return 1;
}
return 0;
}

Why my perl script isn't finding bad indetation from my regex match

My work's coding standard uses this bracket indentation:
some declaration
{
stuff = other stuff;
};
control structure, function, etc()
{
more stuff;
for(some amount of time)
{
do something;
}
more and more stuff;
}
I'm writing a perl script to detect incorrect indentation. Here's what I have in the body of a while(<some-file-handle>):
# $prev holds the previous line in the file
# $current holds the current in the file
if($prev =~ /^(\t*)[^;]+$/ and $current =~ /^(?<=!$1\t)[\{\}].+$/) {
print "$file # line ${.}: Bracket indentation incorrect\n";
}
Here, I'm trying to match:
$prev: A line not ended with a semi-colon, followed by...
$current: A line not having the number of leading tabs+1 of the previous line.
This doesn't seem to match anything, at the moment.
the $prev variable needs some modification.
it should be something like \t* then .+ then not ending in semicolon
also, the $current should be like:
anything ending in ; or { or } not having the number of leading tabs+1 of the previous line.
EDIT
the perl code to try the $prev
#!/usr/bin/perl -l
open(FP,"example.cpp");
while(<FP>)
{
if($_ =~ /^(\t*)[^;]+$/) {
print "got the line: $_";
}
}
close(FP);
//example.cpp
for(int i = 0;i<10;i++)
{
//not this;
//but this
}
//output
got the line: {
got the line: //but this
got the line: }
it did not detect the line with the for loop ...
am i missing something...
i see a couple of problems...
your prev regex matches all lines which do not have a ; anywhere. which will break on lines like (for int x = 1; x < 10; x++)
if the indent of the opening { is incorrect, you will not detect that.
try this instead, it only cares if you have a ;{ (followed by any whitespace) at the end.
/^(\s*).*[^{;]\s*$/
now you should change your strategy so that if you see a line which does not end in { or ; you increment the indent counter.
if you see a line which ends in }; or } decrement your indent counter.
compare all lines against this
/^\t{$counter}[^\s]/
so...
$counter = 0;
if (!($curr =~ /^\t{$counter}[^\s]/)) {
# error detected
}
if ($curr =~ /[};]+/) {
$counter--;
} else if ($curr =~ /^(\s*).*[^{;]\s*$/) }
$counter++;
}
sorry for not styling my code according to your standards... :)
And you intend to only count tabs (not spaces) for indentation?
Writing this kind of checker is complicated. Just think about all the possible constructs that uses braces that should not change indentation:
s{some}{thing}g
qw{ a b c }
grep { defined } #a
print "This is just a { provided to confuse";
print <<END;
This {
$is = not $code
}
END
But anyway, if the issues above aren't important to you, consider whether the semi colon is important at all in your regex. After all, writing
while($ok)
{
sort { some_op($_) }
grep { check($_} }
my_func(
map { $_->[0] } #list
);
}
Should be possible.
Have you considered looking at Perltidy?
Perltidy is a Perl script that reformats Perl code into set standards. Granted, what you have isn't part of the Perl standard, but you can probably tweak the curly braces via the configuration file Perltidy uses. If all else fails, you can hack through the code. After all, Perltidy is just a Perl script.
I haven't really used it, but it might be worth looking into. Your problem is trying to locate all the various edge cases, and making sure you're handling them correctly. You can parse 100 programs to find that the 101st reveal problems in your formatter. Perltidy has been used by thousands of people on millions of lines of code. If there is an issue, it probably already has been found.

Match beginning of file to string literal

I'm working with a multi line text block where I need to divide everything into 3 groups
1: beginning of the file up to a string literal // don't keep
2: The next line //KEEP THE LINE FOLLOWING STRING LITERAL
3: Everything following that line to the end of file. // don't keep
<<
aFirstLing here
aSecondLine here
MyStringLiteral //marks the next line as the target to keep
What I want to Keep!
all kinds of crap that I don't
<<
I'm finding plenty of ways to pull from the beginning of a line but am unable to see how to include an unknown number of non-blank lines until I reach that string literal.
EDIT: I'm removing the .net-ness to focus on regex only. Perhaps this is a place for understanding backreferences?
Rather than read the entire file into memory, just read what you need:
List<string> TopLines = new List<string>();
string prevLine = string.Empty;
foreach (var link in File.ReadLines(filename))
{
TopLines.Add(line);
if (prevLine == Literal)
{
break;
}
prevLine = line;
}
I suppose there's a LINQ solution, although I don't know what it is.
EDIT:
If you already have the text of the email in you application (as a string), you have to split it into lines first. You can do that with String.Split, splitting on newlines, or you can create a StringReader and read it line-by-line. The logic above still applies, but rather than File.ReadLines, just use foreach on the array of lines.
EDIT 2:
The following LINQ might do it:
TopLines = File.ReadLines(filename).TakeWhile(s => s != Literal).ToList();
TopLines.Add(Literal);
Or, if the strings are already in a list:
TopLines = lines.TakeWhile(s => s != Literal).ToList();
TopLines.Add(Literal);
.*(^MyStringLiteral\r?\n)([\w|\s][^\r\n]+)(.+) seems to work. the trick wasn't back references - it was the exclusion of \r\n.
File.ReadAllLines() will give you an array you can iterate over until you find your literal, then take the next line
string[] lines = File.ReadAllLines();
for(int i;i<lines.Length;i++)
{
if(line == Literal)
return lines[i + 1];
}