Using RegEx to insert data by the nth term - regex

I would like to add a term after an nth term with the previous term substituted in. For instance, how would I change the following in notpad++:
x1 y1 z1 a1 b1 c1
x2 y2 z2 a2 b2 c2
x3 y3 z3 a3 b3 c3
to
x1 y1 z1 ["z1"] a1 b1 c1
x2 y2 z2 ["z2"] a2 b2 c2
x3 y3 z3 ["z3"] a3 b3 c3
where x, y, z, a, b and c are strings seperated by spaces.
another example:
apples bananas pears grapes oranges lemons
to
apples bananas pears grapes fruit:(grapes) oranges lemons
and so on.

Suppose you have one group that matches your elements, for example [1-9] and there is another group, that matches the separator between your elements, for example [\,\.], then you can write the following
([1-9][\,\.]){n}([\,\.])([1-9][\,\.])*
This will match the first n elements and the separator after it.
You can then use the matched pattern to substitute the content of the second match with your values.
Is that is something you're looking for?

in the find put...
(\w\s\w\s\w\s)
in the replace put
\1["z"]
See this question for more info.
NotePad++ replace problem
I guess this would also make sense for the find...
(x[ ]y[ ]z[ ])
for your example... if it was the 3rd item it would be
find:
(\w\s\w\s)(\w)
replace
\1\2 fruit(\2)

Related

Replacing string by incrementing number

Input File
AAAAAA this is some content.
This is AAAAAA some more content BBBBBB. BBBBBB BBBBBB
This is yet AAAAAA some more BBBBBB BBBBBB BBBBBB content.
I can accomplish this partially with this code:
awk '/AAAAAA/{gsub("AAAAAA", "x"++i)}1' test.txt > test1.txt
awk '{for(x=1;x<=NF;x++)if($x~/BBBBBB/){sub(/BBBBBB/,"y"++i)}}1' test1.txt
Output:
x1 this is some content.
This is x2 some more content y1. y2 y3
This is yet x3 some more y4 y5 y6 content.
Anyway to get this output?
Expected Output:
x1 this is some content.
This is x2 some more content y1. y2 y3
This is yet x3 some more y1 y2 y3 content.
another one
$ awk '{sub("AAAAAA","x"(++x)); y=0; while(sub("BBBBBB","y"(++y)));}1' file
x1 this is some content.
This is x2 some more content y1. y2 y3
This is yet x3 some more y1 y2 y3 content.
You may use this single awk:
awk '{
j=0
for (x=1; x<=NF; x++)
if ($x ~ /^A{6}/)
sub(/^A{6}/, "x" (++i), $x)
else if ($x ~ /^B{6}/)
sub(/^B{6}/, "y" (++j), $x)
} 1' file
x1 this is some content.
This is x2 some more content y1. y2 y3
This is yet x3 some more y1 y2 y3 content.
Just need to reset i=0 after each loop:
awk '/AAAAAA/{gsub("AAAAAA", "x"++i)}1' test.txt > test1.txt
awk '{for(x=1;x<=NF;x++)if($x~/BBBBBB/){sub(/BBBBBB/,"y"++i)}{i=0}}1' test1.txt
Output:
x1 this is some content.
This is x2 some more content y1. y2 y3
This is yet x3 some more y1 y2 y3 content.
Here is an alternate awk that is easily extended to add as many tags as you wish:
awk 'BEGIN{ rep["AAAAAA"]="x"; cnts["AAAAAA"]=1; reset["AAAAAA"]=0
rep["BBBBBB"]="y"; cnts["BBBBBB"]=1; reset["BBBBBB"]=1
# and so on...
}
{
for (e in rep) {
cnts[e]=(reset[e]) ? reset[e] : cnts[e]
while ( sub(e,rep[e] (cnts[e]++) ) )
; # empty statement since work is inside the while
}
} 1' file
Prints:
x1 this is some content.
This is x2 some more content y1. y2 y3
This is yet x3 some more y1 y2 y3 content.

Reading csv with several subgroups

I have a csv-file that contains "pivot-like" data that I would like to store into a pandas DataFrame. The original data file is divided using different number of whitespaces to differentiate between the level in the pivot-data like so:
Text that I do not want to include,,
,Text that I do not want to include,Text that I do not want to include
,header A,header B
Total,100,100
A,,2.15
a1,,2.15
B,,0.22
b1,,0.22
" slightly longer name"...,,0.22
b3,,0.22
C,71.08,91.01
c1,57.34,73.31
c2,5.34,6.76
c3,1.33,1.67
x1,0.26,0.33
x2,0.26,0.34
x3,0.48,0.58
x4,0.33,0.42
c4,3.52,4.33
x5,0.27,0.35
x6,0.21,0.27
x7,0.49,0.56
x8,0.44,0.47
x9,0.15,0.19
x10,,0.11
x11,0.18,0.23
x12,0.18,0.23
x13,0.67,0.85
x14,0.24,0.2
x15,0.68,0.87
c5,0.48,0.76
x16,,0.15
x17,0.3,0.38
x18,0.18,0.23
d2,6.75,8.68
d3,0.81,1.06
x19,0.3,0.38
x20,0.51,0.68
Others,24.23,0
N/A,,
"Text that I do not want to include(""at all"") ",,
(It looks aweful, but you should be able to paste in e.g. Notepad to see it a bit clearer)
Basically, there are only two columns a and b, but the rows are indented using 0, 3, 6, 9, ... etc whitespaces to differentiate between the levels. So for instance,
zero level, the main group, A has 0 spaces,
first level a1 has 3 spaces,
second level a2 has 6 spaces,
third level a3 has 9 spaces and
fourth and final level has 12 spaces with the corresponding values for columns a and b respectively.
I would now like to be able to read and group this data on these levels in order to create a new summarizing DataFrame, with columns corresponding to these different levels, looking like:
Level 4 Diff(a,b) Level 0 Level 1 Level 2 Level 3
x7 525 C c1 c2 c3
x5 -0.03 A a1 a22 NaN
x4 -0.04 A a1 a22 NaN
x8 -0.08 C c1 c2 c3
…
Any clue on how to do this?
Thanks
Easiest is to split this into different functions
read the file
parse the lines
generate the 'tree'
construct the DataFrame
Parse the lines
def parse_file(file):
import ast
import re
pat = re.compile(r'^( *)(\w+),([\d.]+),([\d.]+)$')
for line in file:
r = pat.match(line)
if r:
spaces, label, a, b = r.groups()
diff = ast.literal_eval(a) - ast.literal_eval(b)
yield len(spaces)//3, label, diff
Reads each line, yields the level, 'label' and diff using a regular expression. I use ast to convert the string to int or float
Generate the tree
def parse_lines(lines):
previous_label = list(range(5))
for level, label, diff in lines:
previous_label[level] = label
if level == 4:
yield tuple(previous_label), diff
Initiates a list of length 5, and then overwrites the level this node is on.
Construct the DataFrame
with StringIO(file_content) as file:
lines = parse_file(file)
index, data = zip(*parse_lines(lines))
idx = pd.MultiIndex.from_tuples(index, names=[f'level_{i}' for i in range(len(index[0]))])
df = pd.DataFrame(data={'Diff(a,b)': list(data)}, index=idx)
Opens the file, constructs the index and generates the DataFrame with the different levels in the index. If you don't want this, you can add a .reset_index() or construct the DataFrame slightly different
df
level_0 level_1 level_2 level_3 level_4 Diff(a,b)
A a1 a2 a3 x1 -0.07
A a1 a2 a3 x2 -0.08000000000000002
A a1 a22 a3 x3 -0.04999999999999999
A a1 a22 a3 x4 -0.04000000000000001
A a1 a22 a3 x5 -0.03
A a1 a22 a3 x6 -0.06999999999999998
C c1 c2 c3 x7 525.0
C c1 c2 c3 x8 -0.08000000000000002
alternative for missing levels
def parse_lines(lines):
labels = [None] * 5
previous_level = None
for level, label, diff in lines:
labels[level] = label
if level == 4:
if previous_level < 3:
labels = labels[:previous_level + 1] + [None] * (5 - previous_level)
labels[level] = label
yield tuple(labels), diff
previous_level = level
the items under a22 don't seem to have a level_3, so it copies that from the previous. If this is unwanted, you can take this variation
df
level_0 level_1 level_2 level_3 level_4 Diff(a,b)
C c1 c2 c3 x1 -0.07
C c1 c2 c3 x2 -0.08000000000000002
C c1 c2 c3 x3 -0.09999999999999998
C c1 c2 c3 x4 -0.08999999999999997
C c1 c2 c4 x5 -0.07999999999999996
C c1 c2 c4 x6 -0.060000000000000026
C c1 c2 c4 x7 -0.07000000000000006
C c1 c2 c4 x8 -0.02999999999999997
C c1 c2 c4 x9 -0.04000000000000001
C c1 c2 c4 x11 -0.05000000000000002
C c1 c2 c4 x12 -0.05000000000000002
C c1 c2 c4 x13 -0.17999999999999994
C c1 c2 c4 x14 0.03999999999999998
C c1 c2 c4 x15 -0.18999999999999995
C c1 c2 c5 x17 -0.08000000000000002
C c1 c2 c5 x18 -0.05000000000000002
C c1 d2 d3 x19 -0.08000000000000002
C c1 d2 d3 x20 -0.17000000000000004

Exclude a few columns from a grouped selection by `dplyr::contains`

Suppose a data frame with several groups of columns (linked by their names, here Bla and D):
df = data.frame(A=1, BlaTata=2, BlaTato=3, BlaTota=4, BlaToto=5,
C=6, D1=7, D2=8, D3=9, D4=10)
# A BlaTata BlaTato BlaTota BlaToto C D1 D2 D3 D4
# 1 2 3 4 5 6 7 8 9 10
How can I easily drop all columns containing Bla (i.e., select(-contains('Bla'))) except for a few of them that I would explicitely "protect" from the (de)selection procedure?
Supposing I want to "protect" BlaTato and BlaToto:
df %>% mutate(saveBlaToto=BlaToto, saveBlaTato=BlaTato) %>%
select(-starts_with('Bla')) %>%
mutate(BlaToto=saveBlaToto, BlaTato=saveBlaTato) %>%
select(-contains('save')) %>%
select(order(colnames(.)))
# A BlaTato BlaToto C D1 D2 D3 D4
# 1 3 5 6 7 8 9 10
There must be an easier and more elegant way ;-)
Supposing it is not handy to select by column index etc.
Something like select(-contains('Bla' but keep c('BlaTato','BlaToto'))) possibly for several columns to be preserved...
EDIT
This question is answered in Frank's "New Question" below.
The original question, simpler and answered in his "First Question", was "How to drop all columns containing B except from B2 in the following data frame":
df = data.frame(A=1, B1=2, B2=3, B3, B4=5, C=6, D1=7, D2=8, D3=9, D4=10)
First question. If you look at ?select, you'll see that you can enter a regular expression, like
# example
df = data.frame(A=1, B1=2, B2=3, B3=4, B4=5, C=6, D1=7, D2=8, D3=9, D4=10)
# goal: drop B, protect B2
df %>% select(-matches('^B[^2]$'))
A B2 C D1 D2 D3 D4
1 1 3 6 7 8 9 10
Reading the regex:
^ and $ indicate start and end of the string.
[^x] means any character except x.
New question. It looks like dplyr doesn't support Perl-style regexes yet, so...
# example
df = data.frame(A=1, BlaTata=2, BlaTato=3, BlaTota=4, BlaToto=5,
C=6, D1=7, D2=8, D3=9, D4=10)
# goal: drop Bla, protect BlaTato, BlaToto
df %>% select(-grep('^Bla(?!Tato|Toto)', names(.), perl=TRUE))
A BlaTato BlaToto C D1 D2 D3 D4
1 1 3 5 6 7 8 9 10
Reading the regex:
(?!xyz) means "don't be followed by xyz"
x|y means x or y
For more info on regular expressions and the base R functions for using them, read ?regex and ?grep. Really, though, you shouldn't name your columns like this. If you find yourself in a position where you need to parse column names, you probably made a mistake earlier on.

Notepad++ regex Insert string at beginning of line between two values

How can I insert at the start of a line a string dependent on two other string values that define beginning and end locations?
So for example, I have
First
x
y
z
Second
a
b
c
Third
d
e
f
The result I would like to achieve is;
First
Q1 x
Q1 y
Q1 z
Second
Q2 a
Q2 b
Q2 c
Third
Q3 d
Q3 e
Q3 f
For the final section there is no string to define the end but just the end of the document.
Thanks!
This task cannot be done using pure regex(this can be proven mathematically), you would have to use something like the Notepad++ Python plugin.
This post has an example: https://superuser.com/questions/376288/how-do-i-add-input-to-my-macro-to-replace-text-in-notepad . For your task you just need to change the input parameter to the latest read cardinal number

Splunk query to compare two fields and select value from 3rd field if the comparison match

I am very new to splunk and need your help in resolving below issue.
I have two CSV files uploaded in splunk instance. Below mentioned is each file and its fileds.
Apple.csv
a. A1 b. A2 c. A3
Orange.csv
a. O1 (may have values matching with values of A3) b. O2
My requirement is as below:
Select set of values of A1,A2,A3 and O2 from Apple.csv and Orange.csv
where A1=”X” and A2=”Y” and A3 = O1
and display the values in a table:
A1 A2 A3
X Y 123
LP HJK 222
X Y 999
O1 O2
999 open
123 closed
65432 open
Output
A1 A2 A3 O2
X Y 123 Open
X Y 999 closed
Very much appreciate your help.
You could do this
source="apple.csv" OR source="orange.csv"
| eval grouping=coalesce(A3,O1)
| stats first(A1) as A1 first(A2) as A2 first(A3) as A3 first(O2) as O2 by grouping
| fields - grouping
Although I would think that considering the timestamp of the events might also be important...