mutliline regex concatenation

mutliline regex concatenation - regex

I have a tough one here which I am struggling with. It involves a case of multiline search-replace and/or concatenation situation.
Here is my input text:
//
import tset flash_read, flash_writ;
vector ( $tset , (XMOSI, XMISO, XSCLK, XSTRMSTRT, XSTRMSCLK, XSTRMCKEN, XXTALIN, XXTALCPUEN, XHVREGON, XFDRESET, XGLDATA5, XGLDATA4, XGLDATA3, XGLDATA2, XGLDATA1, XGLDATA0):H, (XSTRMD3, XSTRMD2, XSTRMD1, XSTRMD0, XNSS3, XNSS2, XNSS1, XNSS0):H, XTECLOCK, XRXDATA, XRXENABLE, XTXDATA, XTXENABLE, XNRESET, XTCK, XTMS, XTDI, XTDO, XNTRST)
{
repeat 2
> flash_writ X0X00X0XXXXXXXXX 0000XXXX X 0 L X X 0 1 0 0 X 0; // XNTRST
repeat 9
> flash_writ X0X00X0XXXXXXXXX 0000XXXX X 0 L X X 0 1 1 0 X 1; // Test Logic Reset
> flash_writ X0X00X0XXXXXXXXX 0000XXXX X 0 L X X 0 1 0 0 X 1; // Run Test Idle
repeat 2
> flash_writ X0X00X0XXXXXXXXX 0000XXXX X 0 L X X 0 1 1 0 X 1; // Select IR
My desired output format is this:
//
import tset flash_read, flash_writ;
vector ( $tset , (XMOSI, XMISO, XSCLK, XSTRMSTRT, XSTRMSCLK, XSTRMCKEN, XXTALIN, XXTALCPUEN, XHVREGON, XFDRESET, XGLDATA5, XGLDATA4, XGLDATA3, XGLDATA2, XGLDATA1, XGLDATA0):H, (XSTRMD3, XSTRMD2, XSTRMD1, XSTRMD0, XNSS3, XNSS2, XNSS1, XNSS0):H, XTECLOCK, XRXDATA, XRXENABLE, XTXDATA, XTXENABLE, XNRESET, XTCK, XTMS, XTDI, XTDO, XNTRST)
{
repeat 2 > flash_writ X0X00X0XXXXXXXXX 0000XXXX X 0 L X X 0 1 0 0 X 0; // XNTRST
repeat 9 > flash_writ X0X00X0XXXXXXXXX 0000XXXX X 0 L X X 0 1 1 0 X 1; // Test Logic Reset
> flash_writ X0X00X0XXXXXXXXX 0000XXXX X 0 L X X 0 1 0 0 X 1; // Run Test Idle
repeat 2 > flash_writ X0X00X0XXXXXXXXX 0000XXXX X 0 L X X 0 1 1 0 X 1; // Select IR
I am looking for a unix one liner that would search for lines that contain repeat in the input text and replace the new line character at the end of repeat count with a space such that the end outcome looks like the repeat line being concatenated with its next line as illustrated in the output text with the indicated number of white spaces.
For lines that do not contain a repeat count, it is just a matter of pushing the commencement of the line to as many spaces as illustrated in the output text.
Some of the areas where I have explored to accomplish this but with futile finishes are
(1) Sed with usage of branch labels, N, pattern space
(2) AWK with changing the RS
(3) Perl with s/// and multiline flag turned on
Granted that this could be done with nested regex if conditions in a full-fledged perl or python script but I am looking for a more elegant solution.

In perl:
perl -0777 -lne 's/^(repeat[ ]+\d+)\s+/\1\t/mg; s/^[ ]*>/\t\t>/mg; print' file
//
import tset flash_read, flash_writ;
vector ( , (XMOSI, XMISO, XSCLK, XSTRMSTRT, XSTRMSCLK, XSTRMCKEN, XXTALIN, XXTALCPUEN, XHVREGON, XFDRESET, XGLDATA5, XGLDATA4, XGLDATA3, XGLDATA2, XGLDATA1, XGLDATA0):H, (XSTRMD3, XSTRMD2, XSTRMD1, XSTRMD0, XNSS3, XNSS2, XNSS1, XNSS0):H, XTECLOCK, XRXDATA, XRXENABLE, XTXDATA, XTXENABLE, XNRESET, XTCK, XTMS, XTDI, XTDO, XNTRST)
{
repeat 2 > flash_writ X0X00X0XXXXXXXXX 0000XXXX X 0 L X X 0 1 0 0 X 0; // XNTRST
repeat 9 > flash_writ X0X00X0XXXXXXXXX 0000XXXX X 0 L X X 0 1 1 0 X 1; // Test Logic Reset
> flash_writ X0X00X0XXXXXXXXX 0000XXXX X 0 L X X 0 1 0 0 X 1; // Run Test Idle
repeat 2 > flash_writ X0X00X0XXXXXXXXX 0000XXXX X 0 L X X 0 1 1 0 X 1; // Select IR
Or, you can also do:
perl -0777 -lpe 's/^(repeat[ ]+\d+)\s+/\1\t/mg; s/^[ ]*>/\t\t>/mg;' file
You may need to play with how many \t in the second substitution, but you get the idea.
Ed's awk is brilliant. You can also do something like that in perl:
perl -lne ' if (/^repeat[\h]+\d+/) {$ll=$_; next}
if (/^\h+>/) {$_=sprintf("%-21s%s",$ll,$_);$ll="";}
print' file

$ awk '
/^repeat/ { pfx = $0; next }
/^ >/ { $0 = sprintf("%-21s%s", pfx, $0); pfx="" }
{ print }
' file
//
import tset flash_read, flash_writ;
vector ( $tset , (XMOSI, XMISO, XSCLK, XSTRMSTRT, XSTRMSCLK, XSTRMCKEN, XXTALIN, XXTALCPUEN, XHVREGON, XFDRESET, XGLDATA5, XGLDATA4, XGLDATA3, XGLDATA2, XGLDATA1, XGLDATA0):H, (XSTRMD3, XSTRMD2, XSTRMD1, XSTRMD0, XNSS3, XNSS2, XNSS1, XNSS0):H, XTECLOCK, XRXDATA, XRXENABLE, XTXDATA, XTXENABLE, XNRESET, XTCK, XTMS, XTDI, XTDO, XNTRST)
{
repeat 2 > flash_writ X0X00X0XXXXXXXXX 0000XXXX X 0 L X X 0 1 0 0 X 0; // XNTRST
repeat 9 > flash_writ X0X00X0XXXXXXXXX 0000XXXX X 0 L X X 0 1 1 0 X 1; // Test Logic Reset
> flash_writ X0X00X0XXXXXXXXX 0000XXXX X 0 L X X 0 1 0 0 X 1; // Run Test Idle
repeat 2 > flash_writ X0X00X0XXXXXXXXX 0000XXXX X 0 L X X 0 1 1 0 X 1; // Select IR
or if you prefer brevity over clarity:
awk '/^repeat/{p=$0;next} /^ >/{$0=sprintf("%-21s",p)$0;p=""} 1' file
and if you want "in place" editing then use GNU awk:
awk -i inplace '/^repeat/{p=$0;next} /^ >/{$0=sprintf("%-21s",p)$0;p=""} 1' file

Related

Cumsum entire table and reset at zero

I have following data frame.
d = pd.DataFrame({'one' : [0,1,1,1,0,1],'two' : [0,0,1,0,1,1]})
d
one two
0 0 0
1 1 0
2 1 1
3 1 0
4 0 1
5 1 1
I want cumulative sum which resets at zero
desired output should be
pd.DataFrame({'one' : [0,1,2,3,0,1],'two' : [0,0,1,0,1,2]})
one two
0 0 0
1 1 0
2 2 1
3 3 0
4 0 1
5 1 2
i have tried using group by but it does not work for entire table.

df2 = df.apply(lambda x: x.groupby((~x.astype(bool)).cumsum()).cumsum())
print(df2)
Output:
one two
0 0 0
1 1 0
2 2 1
3 3 0
4 0 1
5 1 2

pandas
def cum_reset_pd(df):
csum = df.cumsum()
return (csum - csum.where(df == 0).ffill()).astype(d.dtypes)
cum_reset_pd(d)
one two
0 0 0
1 1 0
2 2 1
3 3 0
4 0 1
5 1 2
numpy
def cum_reset_np(df):
v = df.values
z = np.zeros_like(v)
j, i = np.where(v.T)
r = np.arange(1, i.size + 1)
p = np.where(
np.append(False, (np.diff(i) != 1) | (np.diff(j) != 0))
)[0]
b = np.append(0, np.append(p, r.size))
z[i, j] = r - b[:-1].repeat(np.diff(b))
return pd.DataFrame(z, df.index, df.columns)
cum_reset_np(d)
one two
0 0 0
1 1 0
2 2 1
3 3 0
4 0 1
5 1 2
Why go through this trouble?
because it's quicker!

This one is without using Pandas, but using NumPy and list comprehensions:
import numpy as np
d = {'one': [0,1,1,1,0,1], 'two': [0,0,1,0,1,1]}
out = {}
for key in d.keys():
l = d[key]
indices = np.argwhere(np.array(l)==0).flatten()
indices = np.append(indices, len(l))
out[key] = np.concatenate([np.cumsum(l[indices[n-1]:indices[n]]) \
for n in range(1, indices.shape[0])]).ravel()
print(out)
First, I find all occurences of 0 (positions to split the lists), then I calculate cumsum of the resulting sublists and insert them into a new dict.

This should do it:
d = {'one' : [0,1,1,1,0,1],'two' : [0,0,1,0,1,1]}
one = d['one']
two = d['two']
i = 0
new_one = []
for item in one:
if item == 0:
i = 0
else:
i += item
new_one.append(i)
j = 0
new_two = []
for item in two:
if item == 0:
j = 0
else:
j += item
new_two.append(j)
d['one'], d['two'] = new_one, new_two
df = pd.DataFrame(d)

Find position of first non-zero decimal

Suppose I have the following local macro:
loc a = 12.000923
I would like to get the decimal position of the first non-zero decimal (4 in this example).
There are many ways to achieve this. One is to treat a as a string and to find the position of .:
loc a = 12.000923
loc b = strpos(string(`a'), ".")
di "`b'"
From here one could further loop through the decimals and count since I get the first non-zero element. Of course this doesn't seem to be a very elegant approach.
Can you suggest a better way to deal with this? Regular expressions perhaps?

Well, I don't know Stata, but according to the documentation, \.(0+)? is suported and it shouldn't be hard to convert this 2 lines JavaScript function in Stata.
It returns the position of the first nonzero decimal or -1 if there is no decimal.
function getNonZeroDecimalPosition(v) {
var v2 = v.replace(/\.(0+)?/, "")
return v2.length !== v.length ? v.length - v2.length : -1
}
Explanation
We remove from input string a dot followed by optional consecutive zeros.
The difference between the lengths of original input string and this new string gives the position of the first nonzero decimal
Demo
Sample Snippet
function getNonZeroDecimalPosition(v) {
var v2 = v.replace(/\.(0+)?/, "")
return v2.length !== v.length ? v.length - v2.length : -1
}
var samples = [
"loc a = 12.00012",
"loc b = 12",
"loc c = 12.012",
"loc d = 1.000012",
"loc e = -10.00012",
"loc f = -10.05012",
"loc g = 0.0012"
]
samples.forEach(function(sample) {
console.log(getNonZeroDecimalPosition(sample))
})

You can do this in mata in one line and without using regular expressions:
foreach x in 124.000923 65.020923 1.000022030 0.0090843 .00000425 {
mata: selectindex(tokens(tokens(st_local("x"), ".")[selectindex(tokens(st_local("x"), ".") :== ".") + 1], "0") :!= "0")[1]
}
4
2
5
3
6
Below, you can see the steps in detail:
. local x = 124.000823
. mata:
: /* Step 1: break Stata's local macro x in tokens using . as a parsing char */
: a = tokens(st_local("x"), ".")
: a
1 2 3
+----------------------------+
1 | 124 . 000823 |
+----------------------------+
: /* Step 2: tokenize the string in a[1,3] using 0 as a parsing char */
: b = tokens(a[3], "0")
: b
1 2 3 4
+-------------------------+
1 | 0 0 0 823 |
+-------------------------+
: /* Step 3: find which values are different from zero */
: c = b :!= "0"
: c
1 2 3 4
+-----------------+
1 | 0 0 0 1 |
+-----------------+
: /* Step 4: find the first index position where this is true */
: d = selectindex(c :!= 0)[1]
: d
4
: end
You can also find the position of the string of interest in Step 2 using the
same logic.
This is the index value after the one for .:
. mata:
: k = selectindex(a :== ".") + 1
: k
3
: end
In which case, Step 2 becomes:
. mata:
:
: b = tokens(a[k], "0")
: b
1 2 3 4
+-------------------------+
1 | 0 0 0 823 |
+-------------------------+
: end
For unexpected cases without decimal:
foreach x in 124.000923 65.020923 1.000022030 12 0.0090843 .00000425 {
if strmatch("`x'", "*.*") mata: selectindex(tokens(tokens(st_local("x"), ".")[selectindex(tokens(st_local("x"), ".") :== ".") + 1], "0") :!= "0")[1]
else display " 0"
}
4
2
5
0
3
6

A straighforward answer uses regular expressions and commands to work with strings.
One can select all decimals, find the first non 0 decimal, and finally find its position:
loc v = "123.000923"
loc v2 = regexr("`v'", "^[0-9]*[/.]", "") // 000923
loc v3 = regexr("`v'", "^[0-9]*[/.][0]*", "") // 923
loc first = substr("`v3'", 1, 1) // 9
loc first_pos = strpos("`v2'", "`first'") // 4: position of 9 in 000923
di "`v2'"
di "`v3'"
di "`first'"
di "`first_pos'"
Which in one step is equivalent to:
loc first_pos2 = strpos(regexr("`v'", "^[0-9]*[/.]", ""), substr(regexr("`v'", "^[0-9]*[/.][0]*", ""), 1, 1))
di "`first_pos2'"
An alternative suggested in another answer is to compare the lenght of the decimals block cleaned from the 0s with that not cleaned.
In one step this is:
loc first_pos3 = strlen(regexr("`v'", "^[0-9]*[/.]", "")) - strlen(regexr("`v'", "^[0-9]*[/.][0]*", "")) + 1
di "`first_pos3'"

Not using regex but log10 instead (which treats a number like a number), this function will:
For numbers >= 1 or numbers <= -1, return with a positive number the number of digits to the left of the decimal.
Or (and more specifically to what you were asking), for numbers between 1 and -1, return with a negative number the number of digits to the right of the decimal where the first non-zero number occurs.
digitsFromDecimal = (n) => {
dFD = Math.log10(Math.abs(n)) | 0;
if (n >= 1 || n <= -1) { dFD++; }
return dFD;
}
var x = [118.8161330, 11.10501660, 9.254180571, -1.245501523, 1, 0, 0.864931613, 0.097007836, -0.010880074, 0.009066729];
x.forEach(element => {
console.log(`${element}, Digits from Decimal: ${digitsFromDecimal(element)}`);
});
// Output
// 118.816133, Digits from Decimal: 3
// 11.1050166, Digits from Decimal: 2
// 9.254180571, Digits from Decimal: 1
// -1.245501523, Digits from Decimal: 1
// 1, Digits from Decimal: 1
// 0, Digits from Decimal: 0
// 0.864931613, Digits from Decimal: 0
// 0.097007836, Digits from Decimal: -1
// -0.010880074, Digits from Decimal: -1
// 0.009066729, Digits from Decimal: -2

Mata solution of Pearly is very likable, but notice should be paid for "unexpected" cases of "no decimal at all".
Besides, the regular expression is not a too bad choice when it could be made in a memorable 1-line.
loc v = "123.000923"
capture local x = regexm("`v'","(\.0*)")*length(regexs(0))
Below code tests with more values of v.
foreach v in 124.000923 605.20923 1.10022030 0.0090843 .00000425 12 .000125 {
capture local x = regexm("`v'","(\.0*)")*length(regexs(0))
di "`v': The wanted number = `x'"
}

field_delim="\t" doesn't work properly in tf.decode_csv(csv_row, record_defaults=listoflists,field_delim="\t") in tensorflow

I have a tab seperated CSV file
I use the following code fragment
data = tf.decode_csv(csv_row, record_defaults=listoflists,field_delim="\t")
but it arises the following error
tensorflow.python.framework.errors.InvalidArgumentError: Expect 5 fields but have 1 in record 0
but when i make the file into comma separated and space separated , it works correctly
1. Comma Sepeated
data = tf.decode_csv(csv_row, record_defaults=listoflists)
2.Space Separated
data = tf.decode_csv(csv_row, record_defaults=listoflists,field_delim=" ")
The full Code
from __future__ import print_function
import tensorflow as tf
def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
filename = "test.csv"
# setup text reader
file_length = file_len(filename)
filename_queue = tf.train.string_input_producer([filename])
reader = tf.TextLineReader(skip_header_lines=1)
_, csv_row = reader.read(filename_queue)
# setup CSV decoding
#setup text reader
listoflists = []
for i in range(0,5):
listoflists.append((list([0])))
data = tf.decode_csv(csv_row, record_defaults=listoflists,field_delim="\t")
# turn features back into a tensor
print("loading, " + str(file_length) + " line(s)\n")
with tf.Session() as sess:
tf.initialize_all_variables().run()
# start populating filename queue
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
for i in range(file_length):
# retrieve a single instance
example = sess.run(data)
print(example)
coord.request_stop()
coord.join(threads)
print("\ndone loading")
Sample Data
Tab Separated :
1 0 0 0 0
1 0 0 0 0
1 0 0 0 0
1 0 0 0 0
1 0 0 0 0
Comma Separated :
1,0,1,1,1
1,0,1,1,1
1,0,1,1,1
1,0,1,1,1
1,0,1,1,1
1,0,1,1,1
Space Separated :
1 0 0 0 0
1 0 0 0 0
1 0 0 0 0
1 0 0 0 0
1 0 0 0 0

How to count the values of columns row wise based on conditions in Pandas

I have a pandas dataframe df below
df = pd.DataFrame({'id':[1,2,3],'v' : ['r','r','i'], 'w' : ['r','r','i'],'x' : ['r','i','i']})
df
id v w x
1 r r r
2 r r i
3 i i i
The values of columns are r and i. I want to count the occurrences of r and i row wise and generate two more colum headers r and i with the counts of r and i` as values for each row, the final result I am expecting is given below
id v w x r i
1 r r r 3 0
2 i r r 2 1
3 i i i 0 3

Method 1
In [15]:
def count(df):
df['i'] = np.sum(df == 'i')
df['r'] = np.sum(df == 'r')
return df
In [16]:
df.apply(count, axis = 1)
Out[16]:
id v w x i r
0 1 r r r 0 3
1 2 r r i 1 2
2 3 i i i 3 0
Method 2
In [9]:
count = df.apply(lambda x : x.value_counts() , axis = 1)[['i' , 'r']]
count
Out[9]:
i r
0 NaN 3
1 1 2
2 3 NaN
In [10]:
pd.concat([df , count.fillna(0)] , axis = 1)
Out[10]:
id v w x i r
0 1 r r r 0 3
1 2 r r i 1 2
2 3 i i i 3 0

Printing element of List in a different way

I need to print a List of Lists using Scala and the function toString, where every occurrence of 0 needs to be replaced by an '_'. This is my attempt so far. The commented code represents my different attempts.
override def toString() = {
// grid.map(i => if(i == 0) '_' else i)
// grid map{case 0 => '_' case a => a}
// grid.updated(0, "_")
//grid.map{ case 0 => "_"; case x => x}
grid.map(_.mkString(" ")).mkString("\n")
}
My output should look something like this, but an underscore instead of the zeros
0 0 5 0 0 6 3 0 0
0 0 0 0 0 0 4 0 0
9 8 0 7 4 0 0 0 5
1 0 0 0 7 0 9 0 0
0 0 9 5 0 1 6 0 0
0 0 8 0 2 0 0 0 7
6 0 0 0 1 8 0 9 3
0 0 1 0 0 0 0 0 0
Thanks in advance.

Just put an extra map in there to change 0 to _
grid.map(_.map(_ match {case 0 => "_"; case x => x}).mkString(" ")).mkString("\n")

Nothing special:
def toString(xs: List[List[Int]]) = xs.map { ys =>
ys.map {
case 0 => "_"
case x => String.valueOf(x)
}.mkString(" ")
}.mkString("\n")

Although the other solutions are functionally correct, I believe this shows more explicitly what happens and as such is better suited for a beginner:
def gridToString(grid: List[List[Int]]): String = {
def replaceZero(i: Int): Char =
if (i == 0) '_'
else i.toString charAt 0
val lines = grid map { line =>
line map replaceZero mkString " "
}
lines mkString "\n"
}
First we define a method for converting the digit into a character, replacing zeroes with underscores. (It is assumed from your example that all the Int elements are < 10.)
The we take each line of the grid, run each of the digits in that line through our conversion method and assemble the resulting chars into a string.
Than we take we take the resulting line-strings and turn them into the final string.
The whole thing could be written shorter, but it wouldn't necessarily be more readable.
It is also good Scala style to use small inner methods like replaceZero in this example instead of writing all code inline, as the naming of a method helps indicating what it is does, and as such enhances readability.

There's always room for another solution. ;-)
A grid:
type Grid[T] = List[List[T]]
Print a grid:
def print[T](grid: Grid[T]) = grid map(_ mkString " ") mkString "\n"
Replace all zeroes:
for (row <- grid) yield row.collect {
case 0 => "_"
case anything => anything
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

mutliline regex concatenation - regex

Related

Cumsum entire table and reset at zero

Find position of first non-zero decimal

field_delim="\t" doesn't work properly in tf.decode_csv(csv_row, record_defaults=listoflists,field_delim="\t") in tensorflow

How to count the values of columns row wise based on conditions in Pandas

Printing element of List in a different way

Categories

Resources