Convert paged columns to rows with a regular expressions

Convert paged columns to rows with a regular expressions - regex

So first a sample of the actual data mangled (data is originally a mix of text and numbers, there's no significance to any of the data at this point and some of the patterns are just because I replaced most of the characters with 0s, 1s and Zs because the random number generator in my brain is broken):
011.0ZN1ZZ 001.F5ZS1Z 001.ZO5ZY0
014.5ZZZ1Z 001.1SZZOZ 001.ZLMZY0
016.01NM1SU54 001.EX0Z1Z 001.LIZZOZ
018.01NM1SS41 001.F83Z1Z 001.0011M1SU54
014.ZZ1YZZ 001.ZZZ1IZ 001.0011M1SS41
013.2EBSIZ 001.ZZZ11Z 001.0011SE4
01N.ZINSIZ 001.ZZZZ1Z P01.ZZZZ1Z
01N.01NSE4 001.LSZZHG N01.ZZZZ1Z
001.01ON5O 001.5Z21OL F01.ZZZZ1Z
001.NE5ZO1 001.ZOM05O D01.ZZZZ1Z
001.ZO5ZOZ 001.01NO1G Z01.ZZZZ1Z
001.ZO5ZOZ 001.01NO1G Z01.ZZZZ1Z
001.011ZOZ 001.01NZ0Y
Some additional comments.. I can clean up whitespace and deal with record length with no issues, so I'd like to simplify the question to this, I'm just including the above in case there's a solution to the simplified version that can't be easily extended to a more complex version.
1 7 13
2 8 14
3 9 15
4 10 16
5 11 17
6 12 18
19 25
20 26
21 27
22 28
23 29
24
So there will be a variable number of pages, but the same number of columns and rows on each page (although, in case it matters significantly, it's actually 12x3 instead of 6x3 but I wanted to keep it simple if possible), although the last page may be some empty rows/columns.
I'm using notepad++ but I have access to various gnutilities so if there's a solution that's way, way better than a regular expression I don't mind, although since I'll be using this a lot and use notepad++ a lot I'd appreciate a regex solution if it isn't too insane.

If you've got Git installed on your Windows machine, you may use Perl bundled with it from Git bash. Provided your input file is named data, try the following command (caution: it will orverwrite the input file):
echo >>data ; \
perl -i -lane'
$i=0;
push #{$c[$i++]}, $_ foreach #F;
if (/^\s*$/) {
push #l, #{$_} foreach #c;
print "#l\015";
#l=#c=();
}' data
The Perl command treats each line of input as space delimited fields and accumulates the fields in the #c matrix. When encounters an empty line (if (/^\s*$/) ...), it prints the matrix columns concatenated in a list.
The input file is changed in-place. A backup copy data.bak is created.
The input file may not end with an empty line so I add one with echo >>data. This makes the Perl script shorter and easier.
Another trick is the trailing \015 in print "#l\015";. This allows us to get Windows CRLF line endings in Unix-flavoured Git bash environment.
A demo can be found here: https://ideone.com/vnYoOd. But since Ideone forbids file read/write, the original command has been modified to make the code run there.

Related

Issues while processing zeroes found in CSV input file with Perl

Friends:
I have to process a CSV file, using Perl language and produce an Excel as output, using the Excel::Writer::XSLX module. This is not a homework but a real life problem, where I cannot download whichever Perl version (actually, I need to use Perl 5.6), or whichever Perl module (I have a limited set of them). My OS is UNIX. I can also use (embedding in Perl) ksh and csh (with some limitation, as I have found so far). Please, limit your answers to the tools I have available. Thanks in advance!
Even though I am not a Perl developer, but coming from other languages, I have already done my work. However, the customer is asking for extra processing where I am getting stuck on.
1) The stones in the road I found are coming from two sides: from Perl and from Excel particular styles of processing data. I already found a workaround to handle the Excel, but -as mentioned in the subject- I have difficulties while processing zeroes found in CSV input file. To handle the Excel, I am using the '0 way which is the final way for data representation that Excel seems to have while using the # formatting style.
2) Scenario:
I need to catch standalone zeroes which might be present in whichever line / column / cell of the CSV input file and put them as such (as zeroes) in the Excel output file.
I will go directly to the point of my question to avoid loosing your valuable time. I am providing more details after my question:
Research and question:
I tried to use Perl regex to find standalone "0" and replace them by whichever string, planning to replace them back to "0" at the end of processing.
perl -p -i -e 's/\b0\b/string/g' myfile.csv`
and
perl -i -ple 's/\b0\b/string/g' myfile.csv
Are working; but only from command line. They aren't working when I call them from the Perl script as follows:
system("perl -i -ple 's/\b0\b/string/g' myfile.csv")
Do not know why... I have already tried using exec and eval, instead of system, with the same results.
Note that I have a ton of regex that work perfectly with the same structure, such as the following:
system("perl -i -ple 's/input/output/g' myfile.csv")
I have also tried using backticks and qx//, without success. Note that qx// and backticks have not the same behavior, since qx// is complaining about the boundaries \b because of the forward slash.
I have tried using sed -i, but my System is rejecting -i as invalid flag (do not know if this happens in all UNIX, but at least happens in the one at work. However is accepting perl -i).
I have tried embedding awk (which is working from command line), in this way:
system `awk -F ',' -v OFS=',' '$1 == \"0\" { $1 = "string" }1' myfile.csv > myfile_copy.csv
But this works only for the first column (in command line) and, other than having the disadvantage of having extra copy file, Perl is complaining for > redirection, assuming it as "greater than"...
system(q#awk 'BEGIN{FS=OFS=",";split("1 2 3 4 5",A," ") } { for(i in A)sub(0,"string",$A[i] ) }1' myfile.csv#);
This awk is working from command line, but only 5 columns. But not in Perl using #.
All the combinations of exec and eval have also been tested without success.
I have also tried passing to system each one of the awk components, as arguments, separated by commas, but did not find any valid way to pass the redirector (>), since Perl is rejecting it because of the mentioned reason.
Using another approach, I noticed that the "standalone zeroes" seem to be "swallowed" by the Text::CSV module, thus, I get rid off it, and turned back to a traditional looping in csv line by line and a spliter for commas, preserving the zeroes in that way. However I found the "mystery" of isdual in Perl, and because of the limitation of modules I have, I cannot use the Dumper. Then, I also explored the guts of binaries in Perl and tried the $x ^ $x, which was deprecated since version 5.22 but valid till that version (I said mine is 5.6). This is useful to catch numbers vs strings. However, while if( $x ^ $x ) returns TRUE for strings, if( !( $x ^ $x ) ) does not returns TRUE when $x = 0. [UPDATE: I tried this in a devoted Perl script, just for this purpose, and it is working. I believe that my probable wrong conclusion ("not returning TRUE") was obtained when I did not still realize that Text::CSV was swallowing my zeroes. Doing new tests...].
I will appreciate very much your help!
MORE DETAILS ON MY REQUIREMENTS:
1) This is a dynamic report coming from a database which is handover to me and I pickup programmatically from a folder. Dynamic means that it might have whichever amount of tables, whichever amount of columns in each table, whichever names as column headers, whichever amount of rows in each table.
2) I do not know, and cannot know, the column names, because they vary from report to report. So, I cannot be guided by column names.
A sample input:
Alfa,Alfa1,Beta,Gamma,Delta,Delta1,Epsilon,Dseta,Heta,Zeta,Iota,Kappa
0,J5,alfa,0,111.33,124.45,0,0,456.85,234.56,798.43,330000.00
M1,0,X888,ZZ,222.44,111.33,12.24,45.67,0,234.56,0,975.33
3) Input Explanation
a) This is an example of a random report with 12 columns and 3 rows. Fist row is header.
b) I call "standalone zeroes" those "clean" zeroes which are coming in the CSV file, from second row onwards, between commas, like 0, (if the case is the first position in the row) or like ,0, in subsequent positions.
c) In the second row of the example you can read, from the beginning of the row: 0,J5,alfa,0, which in this particular case, are "words" or "strings". In this case, 4 names (note that two of them are zeroes, which required to be treated as strings). Thus, we have a 4 names-columns example (Alfa,Alfa1,Beta,Gamma are headers for those columns, but only in this scenario). From that point onwards, in the second row, you can see floating point (*.00) numbers and, among them, you can see 2 zeroes, which are numbers. Finally, in the third line, you can read M1,0,X888,Z, which are the names for the first 4 columns. Note, please, that the 4th column in the second row has 0 as name, while the 4th column in the third row has ZZ as name.
Summary: as a general picture, I have a table-report divided in 2 parts, from left to right: 4 columns for names, and 8 columns for numbers.
Always the first M columns are names and the last N columns are numbers.
- It is unknown which number is M: which amount of columns devoted for words / strings I will receive.
- It is unknown which number is N: which amount of columns devoted for numbers I will receive.
- It is KNOWN that, after the M amount of columns ends, always starts N, and this is constant for all the rows.

I have done a quick research on Perl boundaries for regex ( \b ), and I have not found any relevant information regarding if it applies or not in Perl 5.6.
However, since you are using and old Perl version, try the traditional UNIX / Linux style (I mean, what Perl inherits from Shell), like this:
system("perl -i -ple 's/^0/string/g' myfile.csv");
The previous regex should do the work doing the change at the start of the each line in your CSV file, if matches.
Or, maybe better (if you have those "standalone" zeroes, and want avoid any unwanted change in some "leading zeroes" string):
system("perl -i -ple 's/^0,/string,/g' myfile.csv");
[Note that I have added the comma, after the zero; and, of course, after the string].
Note that the first regex should work; the second one is just a "caveat", to be cautious.

Regex on binary data in python 3.6 shows weird behaviour

I'm trying to write a small script that reads binary data from a file before processing it further, with e.g. regex for simplifying some steps.
During the regex I'm seeing some weird behavior I just can't figure out of. Code basically goes like this (heavily stripped down to just include the relevant part):
fh = open(filename,'rb')
bd = fh.read(32) # binary data
xlen = bd[3] # byte that specifies length of a command - this may vary for each 32h byte read
bd_x = bd[4:4+xlen] # pick out interesting part of the data. for the data I see the weird behavior the length of bd_x will always be 7
if re.match(b'\x00((.*?){%d})\x30'%(xlen-2),bd_x):
update some other lists, etc
Just need to check if start of interesting data is \x00 and if end is \x30 with 5 other elements in between, for which value is irrelevant. Total length, including start and end I'm trying to match is thus 7, as mentioned.
In a sample file I have with random data, this works on about 100 of 130 32h byte chunks, for which it should match on all 130 and not just 100.
I did print out out the content of bd_x for both cases, e.g. for chunks where it worked and chunks where it didn't. Output from print(xlen,hexlify(bd_x)) (n for negative, p for positive).
n 7 b'000000290a0030'
n 7 b'0000002b0a0030'
n 7 b'0000002d0a0030'
n 7 b'0000002f0a0030'
n 7 b'000000310a0030'
n 7 b'000000330a0030'
p 7 b'00000003000030'
p 7 b'00000005000030'
p 7 b'00000000000030'
p 7 b'00000000020030'
As far as I can see, all samples should have matched in the regex. If I change to re.search it matches on all 130 chunks, but tbh I don't know why re.match isn't working for all chunks of data as the start always matches with \x00 and the rest should match too.
I've manually checked all the entries of the test file that fail in a hex-editor, and I just can't see why it doesn't work on those entries.
I know I can probably just do hexlify(bd_x) and just operate on the output from that function instead, but for now I'm interested in figuring out why this doesn't work.
Suggestions / solutions are appreciated.

Vim: Placing (,) in between CERTAIN high numbers Issue

source txt file:
34|Gurla Mandhata|7694|25243|2788|NalakankarÂ Himalaya|30Â°26'19"N
81Â°17'48"E|Dhaulagiri|1985|6 (4)|China
command input:
:%s/\(\d\+\)\(\d\d\d\)/\1,\2/g
command output:
34|Gurla Mandhata|7,694|25,243|2,788|NalakankarÂ Himalaya|30Â°26'19"N
81Â°17'48"E|Dhaulagiri|1,985|6 (4)|China
Desired output:
34|Gurla Mandhata|7,694|25,243|2,788|NalakankarÂ Himalaya|30Â°26'19"N
81Â°17'48"E|Dhaulagiri|1985|6 (4)|China
Basically 1985 is supposed to be 1985 and not 1,985. I tried to put a \? so every time the pattern matches it stops and a °+ after so it has to detect a ° to match the pattern, but no success. It just replaces the ° and everything before that, complete mess.
My knowledge of regular expressions however combined with the substitute is weak and I'm stuck here.
EDIT
the first 3 numbers represent heights of mountains, those 3 need to change with a (,) and the last number ( 1985 ) represents a year, which must not be changed.
Mathematical solutions are not going to work as loophole since there are mountains with a height off less than 1900

You haven't told us what is the difference between 1985 and other numbers, so I assumed that your "small" numbers are less than 2000.
You almost got it:
:%s/(\d*[2-90])(\d\d\d)/\1,\2/g
Alternatively if that isn't what you want, you can use c flag (:h s_flags):
:%s/\(\d\+\)\(\d\d\d\)/\1,\2/gc

this line will leave the last 3 columns untouched, just do substitution on the content before it:
%s/\v(.*)((\|[^|]*){3}$)/\=substitute(submatch(1),'\v(\d+)(\d{3})','\1,\2','g').submatch(2)/g
Note that the above line will change 1000000 into 1000,000 instead of 1,000,000. Vim's printf() doesn't support %'d, it is pity. If you do have number > 1m, we can find other solutions.
update

I solved it myself, by using 3 seperate commands; one for every number string in the file:
%s/^\(\d*|[^|]*|\)\(\d\+\)\(\d\d\d\)|/\1\2,\3|/g
:%s/^\(\d*|[^|]*|\d\+,*\d*|\)\(\d\+\)\(\d\d\d\)|/\1\2,\3|/g
:%s/^\(\d*|[^|]*|\d\+,*\d*|\d\+,*\d*|\)\(\d\+\)\(\d\d\d\)|/\1\2,\3|/g

In case you want to use perl:
:%!perl -F'\|' -lane 'for(#F[2..4]) { s/(\d+)(\d{3})/\1,\2/;} print join "|", #F'

Why don't `csplit` and `grep` agree on whether there are matches?

I am trying to use csplit in BASH to separate a file by years in the 1500-1600's as delimiters.
When I do the command
csplit Shakespeare.txt '/1[56]../' '{36}'
it almost works, except for at least two issues:
This outputs 38 files, not 36, numbered xx00 through xx37. (Also xx00 is completely blank.) I don't understand how this is possible.
One of the files (why, it seems, that csplit returns 37 non-empty files instead of the 36 non-empty files I expected) doesn't begin with 15XX or 16XX -- it begins with "ACT 4 SCENE 15\n" (where \n is supposed to denote a newline or line break). I don't understand how csplit can match a new line/line break with a number.
When I do the command (which is what I want)
csplit Shakespeare.txt '/1[56][0-9][0-9]/' '{36}'
the terminal returns the error: csplit: 1[56][0-9][0-9]: no match plus listing all of the numbers it lists when the above is executed.
This especially doesn't make sense to me, since grep says otherwise:
grep -c "1[56][0-9][0-9]" Shakespeare.txt
36
grep -c "1[56].." Shakespeare.txt
36
Note: man csplit indicates that I have the BSD version from January 26, 2005. man grep indicates that I have the BSD version from July 28, 2010.

Based on the answer given here by user 'DRL' on 06-20-2008, I decided to try adding the -k option to csplit.
csplit -k Shakespeare.txt '/^1[56][0-9][0-9]/' '{36}'
This returned an error: csplit: ^1[56][0-9][0-9]: no match
However, it still gave (more or less) the desired output: files xx00.txt through xx36.txt (not xx37.txt), and each of the non-empty files, xx01.txt-xx36.txt had the expected/desired content. (In particular, no file began with "ACT 4 SCENE 15".
The man page for csplit says the following about the -k flag:
-k Do not remove output files if an error occurs or a HUP, INT or TERM signal is received.
Honestly I don't quite understand what this means, but I still have the following conjecture about why this solution worked/works:
Conjecture: csplit expects the beginning of the file to match the regex. Thus, since the beginning line of the file did not match ^1[56][0-9][0-9], it threw a tantrum and quit without the -k flag.
Nevertheless, I still don't understand why 1[56][0-9][0-9] did not work, maybe the same reason. And I definitely don't understand why 1[56].. did not work (i.e. why csplit produced a 37th file not beginning with the pattern).

Using Sed or Script to Inline Edit Values in Data Files With Variable Spacing

I have a number of scripts that replace variables separate by white space.
e.g.
sed -i 's/old/new/g' filename.conf
But say I have
#NAME Weight Age Name
Boss 160.000 43 BOB
The below data is made more readable if it stays within the current alignment, so to speak. So if I'm writing a new double, I'd like to only overwrite the width of each of the fields.
My questions are:
1. How to I capture the patterns between values to preserve spaces?
2. Does sed feature a way to force a shell variable say ${FOOBAR} to be a certain width?
3a. If so how do I define this replace field width?
3b. If not what program in Linux is best suited for this truncation assuming I use a mix of number and string data?
EDIT 1
Let me give a couple more examples.
Let's say my file is:
#informative info on this config var.
VAR1 131 comment second_comment
#more informative info
VAR2 3.4 13132 yet_another_comment
#FOO THE VALUE WARNING
Foo 5.6 donteditthis_comment
#BAR ANOTHER VALUE WARNING
Bar 6.5 donteditthis_comment
#Yet another informative comment
VAR3 321
in my bash script I have:
#!/bin/bash
#Vars -- real script will have vals in arrays as
#multiple identically tagged config files will be altered
FOO='Foo'
BAR='Bar'
FOO_VAL_NEW='33.3333'
BAR_VAL_NEW='22.1111'
FILENAME='file.conf'
#Define sed patterns
#These could be inline, but are defined here for readability as they're long.
FOO_MATCH=${FOO}<...whatever special character can be used to capture whitespace...>'[0-9]*.*[0-9]*'
FOO_REPLACE=${FOO}<...whatever special characters to output captured whitespace...>${FOOD_VAL_NEW}
BAR_MATCH=${BAR}<...whatever special character can be used to capture whitespace...>'[0-9]*.*[0-9]*'
BAR_REPLACE=${FOO}<...whatever special characters to output captured whitespace...>${BAR_VAL_NEW}
#Do the inline edit ... will be in a loop to handle multiple
#identically tagged config files in full-fledged script.
sed -i "s/${FOO_MATCH}/${FOO_REPLACE}/g" ${FILENAME}
sed -i "s/${BAR_MATCH}/${BAR_REPLACE}/g" ${FILENAME}
My expected output is:
#informative info on this config var.
VAR1 131 comment second_comment
#more informative info
VAR2 3.4 13132 yet_another_comment
#FOO THE VALUE WARNING
Foo 33.3333 donteditthis_comment
#BAR ANOTHER VALUE WARNING
Bar 22.1111 donteditthis_comment
#Yet another informative comment
VAR3 321
Currently my script works... but there's a couple of annoyances/dangers.
PROBLEM 1
Currently to match the tag, I include the exact whitespace characters after it. E.g. for the given example I would define
FOO='Foo '
...as I'm unsure of how to capture ws characters and then output them in the the replace field.
This is nice for me, as I know I'm going to keep the spaces to the first field the same, to maintain readability. But if one of my users (this is for a public project) writes their own file and writes:
#FOO THE VALUE WARNING
Foo 22.0
Now my script is broken for them. I need to capture the whitespace chars in my match pattern, then output them in my output pattern. That way it will play nice with my file (optimally spaced for readability) but if someone wants to muck things up and not space things nicely it will still work for them as well, preserving their current spaces.
PROBLEM 2
Okay so we've read a tag and injected a consistent amount of spaces after it for the replace, based on what we found with a regex in the match.
But now I need to replace fields within the string.
Currently my script does this. However it isn't the clean style I show above in my desired input. For the above script, for example, I'd get:
#informative info on this config var.
VAR1 131 comment second_comment
#more informative info
VAR2 3.4 13132 yet_another_comment
#FOO THE VALUE WARNING
Foo 33.3333 donteditthis_comment
#BAR ANOTHER VALUE WARNING
Bar 22.1111 donteditthis_comment
#Yet another informative comment
VAR3 321
Well the values are right, but all that work for readability is ruined.... argghhh. Now if I opened the files in emacs and pressed the insert key I would be able to arrow over to the '3' in the Foo tagged value and then start typing the new value and get the output file I listed as desired. I want my sed inline edit to do the same thing... (Maybe as Kent showed this is possible with column?)
I want it to only overwrite on the trailing end. Further, I want it to start the next field (let's say I do end up editing the warning) at the same column it started at in the old file.
Put more simply I want to do a variant sed -i "s/${MATCH}/${REPLACE}/g" ${FILENAME} that writes replacement variables to a tagged line, starting at the same column that entry is at in the CURRENT version of the config. file.
This requires both saving the spaces and somehow coding to only write on the trailing end and pad the output so that the next entry stays in the same starting column if my new value's string is shorter than the old one.
In order to improve upon my current solution it is crucial to both maintain the column start position for each piece of data in a tagged entry and secondly to be able to match a tag with an arbitrary amount of trailing whitespace (which must be preserved)... these are trivial operations in a text editor (see the emacs example above) with the help of the insert key, but more complicated in the script scenario.
This way:
1. I make sure the values can be written no matter how other users space their file.
2. If users (like myself) do bother to match the fields column-wise to the comment above to improve readability, then the script won't mess this up, as it only writes on the trailing side.
Let me know if this is unclear at all.
If this can't be done or is overly onerous with sed alone, I'd be open to an efficient perl or python subscript that my bashscript would call, although obviously an inline solution (if concise and understandable) is preferable, if possible.

the column may help you, see the example below if it was you are looking for:
kent$ cat f
#NAME Weight Age Name
Boss 160.000 43 BOB
kent$ sed 's/160.000/7.0/' f|column -t
#NAME Weight Age Name
Boss 7.0 43 BOB
kent$ sed 's/160.000/7.7777777777/' f|column -t
#NAME Weight Age Name
Boss 7.7777777777 43 BOB

Using one of your sample datasets, you can get
$ doit Weight 160 7.555555 <<\EOD
#NAME Weight Age Name
Boss 160.000 43 BOB
Me 180 25 JAKE
EOD
#NAME Weight Age Name
Boss 7.555555555555 43 BOB
Me 180 25 JAKE
$
with this function:
$ doit ()
{
awk -v tag=$1 -v old=$2 -v new=$3 '
NR==1 { for (i=0;i++<NF;) field[$i]=i } # read column headers
$field[tag] == old {
$field[tag] = new
}
{print}
' | column -t
}
the useful part being loading the column headers into the field name->column map. With tag being "Weight", field[tag] evaluates to 2 for this input so $field[tag] is $2 i.e. the second field, the Weight column.
To answer your questions as asked:
My questions are:
How do I capture the patterns between values to preserve spaces?
Because of what Kent pointed out, it's probably best to regenerate spacing correct for the new data. If preserving the exact input spacing where at all possible, forcing lines with replacement values to have different alignment for some values, I'd say ask that again as a separate "no, really, help me here" question.
Does sed feature a way to force a shell variable say ${FOOBAR} to be a certain width?
sed's Turing complete, but that's as close to a feature as it's got for this. Sardonic humor aside, the only correct answer here is "no".
3b. If not what program in Linux is best suited for this truncation assuming I use a mix of number and string data?
Kent got that one. I didn't know about column, I get questions answered here I didn't even know to ask. For the value location and substitution awk should do you just fine.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js