GCP Data Prep- forward and backward fill

GCP Data Prep- forward and backward fill - google-cloud-platform

I have the following table which I am trying to wrangle in GCP Data prep:
Timestamp Event
2018-04-01 0
2018-04-02 0
2018-04-03 0
2018-04-04 0
2018-04-05 1
2018-04-06 0
2018-04-07 0
2018-04-08 0
I am trying to transform it in a way such that if Event is 1, then the previous 3 entries in the Event are set to 1 and the next 2 entries in Event are set to 2.
So, essentially the data set will look like the below after transformation
Timestamp Event
2018-04-01 0
2018-04-02 1
2018-04-03 1
2018-04-04 1
2018-04-05 1
2018-04-06 2
2018-04-07 2
2018-04-08 0
I have tried to use window and conditionals to achieve this, but w/o success.
Any ideas on how this transformation can be achieved? I am open to splitting the column or creating a new derived column if that can help achieve this result.
Thanks!

You can use window functions as part of your conditions in your IF statements. Using the PREV and NEXT window functions you can get the values at X rows above or below the current row in your window. Once you got the values, you can compare if they match the expected value and shape your IF statement accordingly.
For your use case, you need to verify if the PREV value at 1 or 2 position prior is equal to one and replace these rows by the number 2. If not true, if the NEXT value at position 1, 2 or 3 is equal to 1, the rows should be replaced with the number 1. Lastly, you need to check if the value at the current row is 1 and replace the remaining rows with 0. Converting this into a formula accepted by Dataprep would look like the following:
IF(PREV(Event, 1) == 1 || PREV(Event, 2) == 1, 2, IF(NEXT(Event, 1) == 1 || NEXT(Event, 2) == 1 || NEXT(Event, 3) == 1, 1, IF(Event == 1, 1, 0)))
To enter this formula on Dataprep, under the Function tab, select “Custom Formula”. Under the custom formula window, set the formula type to “Multiple row formula” as the PREV and NEXT function requires an additional argument specifying which column to sort by.

Related

Conditional formatting of a rectangle cell range defined by user input

In a Google Sheet with a cell range of 26x26 (so A1:Z26), I need to conditionally format (change the color to green) a rectangle area that is defined by user input.
Example of user input (4 values required):
hsize = 5 / vsize = 4 / starth = 3 / startv = 2
This means that the conditionally formatted area should be a rectangle from C2:G5 because the start cell horizontally is 3 (column C) and vertically 2 (row 2), and the size of the rectangle horizontally is 5 (C,D,E,F,G) and vertically 4 (2,3,4,5).
I already solved this with Apps Script but due to given restrictions I have to implement this without using any scripts.
I have numbered the whole 26x26 area (=sequence(26,26)) to get numbers from 1 to 676 that I could then use for the conditional formatting.
By doing this, I can limit the conditional formatting to the values between the start and the end value (in the example above that would be 29 (C2) and 111 (G5)). This works by using a simple and/if formula in the conditional formatting.
But the problem with this is that all the cells with values from 29 to 111 are now colored, not only the rectangle C2:G5.
I can't figure out how to define a formula that does what I need. How can I do this and limit the highlighted area to the defined cell range of the rectangle?
[Picture here]: green is the conditional formatting from 29 (C2) to 111 (G5), but what I actually need is that only the red-framed area should be shown in green.

try:
=REGEXMATCH(""&A1, "^"&TEXTJOIN("$|^", 1, INDIRECT(
ADDRESS($AB$4, $AB$3)&":"&ADDRESS($AB$2+$AB$4-1, $AB$1+$AB$3-1)))&"$")
or better:
=(COLUMN(A1)>=$AB$3) *(ROW(A1)>=$AB$4)*
(COLUMN(A1)<$AB$1+$AB$3)*(ROW(A1)<$AB$2+$AB$4)

Is there a way to count from last condition x?

I have a complex set of data that can return 3 different conditions per row. I need to be able to count the last x rows matching one of the specific conditions.
The following formula has been working well for me, but I have discovered a glitch in one instance of this formula (the formula is replicated at least a dozen times)
=ArrayFormula(LOOKUP(9.99999999999999E+307,IF(FREQUENCY(IF(AQ:AQ)=1,ROW(AQ:AQ)),IF(AQ:AQ<>1,ROW(AQ:AQ)))=0,FREQUENCY(IF(AQ:AQ=1,ROW(AQ:AQ)),IF(AQ:AQ=0,ROW(AQ:AQ))))))
Current criteria are as such:
0: Condition x met - Reset counter
1: Condition y met - Increment counter
2: Condition z met - Ignore this row
Therefore this:
1
2
2
2
1
1
0
1
1
1
Should output: 3
This:
1
2
0
2
2
1
2
1
Should output: 2
However the glitch I have encountered isn't resetting the counter when 0 is reached, for example:
1
2
1
2
1
1
2
2
2
2
0
Should output: 0
But in fact is outputting: 4
I have tested all possible conditions with that specific data set and I cannot rectify the issue. I believe there is an error in the formula (specifically the 9.99999999999999E+307) but I wrote it so long ago that I cannot successfully debug it. I have tried 1E+306 but the result is the same.
EDIT1: Upon request I have included as stripped down version of the sheet as I can while recreating the issue.
https://docs.google.com/spreadsheets/d/1SOXiFMEQelqptBvjcabMZGNgG60TRRbe_b65rzT1bi0/edit?usp=sharing
If you scroll to the bottom of the sheet you can see Col AQ has a 0, as a result the value in the cell AF2 should be 0.
You will notice in the sheet that I am using Named Ranges.
EDIT2: player0's answer was PERFECT!! <3
I modified the new formula to adapt to my spreadsheet so it could accommodate Named Ranges and drop-down lists. This question helped me a lot with that:
Convert column index into corresponding column letter
The final formula (just FYI) turned out to be:
=ARRAYFORMULA(COUNTIF(
INDIRECT(REGEXEXTRACT(ADDRESS(ROW(), column(INDIRECT($A$1 & Z$1 & "L"))), "[A-Z]+")&
MAX(IF((INDIRECT($A$1 & Z$1 & "L")=0)*(INDIRECT($A$1 & Z$1 & "L")<>""),
ROW(INDIRECT($A$1 & Z$1 & "L"))+1,5))&":"&
REGEXEXTRACT(ADDRESS(ROW(), column(INDIRECT($A$1 & Z$1 & "L"))), "[A-Z]+")), 1))

=ARRAYFORMULA(COUNTIF(INDIRECT("A"&
MAX(IF((A2:A=0)*(A2:A<>""), ROW(A2:A)+1, ROW(A2)))&":A"), 1))
spreadsheet demo

OpenMVG mainComputeMatches.cpp input explanation

How can i compute descriptor matches using --pair_list option file (using main_ComputeMatches.cpp).
What is the format of data in the file specified by input --pair_list?
Thanks in advance for any suggestion.

You must list the image view index.
See your sfm_data.json, each image is linked to a view index.
In the pair_list file you just connect the image pair your wanna try to compute like
0 1
0 2
1 2
1 6
...

Expanding #Jack answer, you could also compact it like this:
0 1 2
1 2 6
...

How do I set different column widths for each column of a tktable?

I have a table made using Python 2.7 and tktable v1.1 that looks like the following:
class GUI (Tkinter.Tk):
self.testTable = tktable.Table(self, rows = 30, cols = 30, state='disabled',titlecols=1,titlerows=1, \
selectmode='extended', variable=self.tktableArray, selecttype='row', colstretchmode='unset', \
maxwidth=500, maxheight=190, xscrollcommand = self.HScroll.set, yscrollcommand = self.VScroll.set) # Create the results table
self.testTable.grid(column= 2, row = 6, columnspan = 4) # Add results table to the grid
Irrelevant code was left out in order to not throw a wall of code up. My desire here is to size the column widths independently for each column. For instance in column 0 I have only 3 digit numbers and in column 1 I have a 10 character word. I know that I could use
self.testTable.configure(colwidth=10)
to set the widths of the columns but that applies to all columns. Is there a way to do this on a per-column basis? And even better, is there a way to make the column widths fit to the contents of the column? Any help is appreciated.

I've never used a tktable, but a quick read of the tktable documentation shows there's a width() method on the table object. Have you tried that?
# set width of column 0 to 3, column 1 to 10
self.testTable.width(0,3,1,10)

The right answer is:
columnwidth={'0':7,'1':12,'2':20,'3':35,'4':15,'5':15,'6':22}
self.table.width(**columnwidth)

Calculating the distance between characters

Problem: I have a large number of scanned documents that are linked to the wrong records in a database. Each image has the correct ID on it somewhere that says where it belongs in the db.
I.E. A DB row could be:
| user_id | img_id | img_loc |
| 1 | 1 | /img.jpg|
img.jpg would have the user_id (1) on the image somewhere.
Method/Solution: Loop through the database. Pull the image text in to a variable with OCR and check if user_id is found anywhere in the variable. If not, flag the record/image in a log, if so do nothing and move on.
My example is simple, in the real world I have a guarantee that user_id wouldn't accidentally show up on the wrong form (it is of a specific format that has its own significance)
Right now it is working. However, it is incredibly strict. If you've worked with OCR you understand how fickle it can be. Sometimes a 7 = 1 or a 9 = 7, etc. The result is a large number of false positives. Especially among images with low quality scans.
I've addressed some of the image quality issues with some processing on my side - increase image size, adjust the black/white threshold and had satisfying results. I'd like to add the ability for the prog to recognize, for example, that "81*7*23103" is not very far from "81*9*23103"
The only way I know how to do that is to check for strings >= to the length of what I'm looking for. Calculate the distance between each character, calc an average and give it a limit on what is a good average.
Some examples:
Ex 1
81723103 - Looking for this
81923103 - Found this
--------
00200000 - distances between characters
0 + 0 + 2 + 0 + 0 + 0 + 0 + 0 = 2
2/8 = .25 (pretty good match. 0 = perfect)
Ex 2
81723103 - Looking
81158988 - Found
--------
00635885 - distances
0 + 0 + 6 + 3 + 5 + 8 + 8 + 5 = 35
35/8 = 4.375 (Not a very good match. 9 = worst)
This way I can tell it "Flag the bottom 30% only" and dump anything with an average distance > 6.
I figure I'm reinventing the wheel and wanted to share this for feedback. I see a huge increase in run time and a performance hit doing all these string operations over what I'm currently doing.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

GCP Data Prep- forward and backward fill - google-cloud-platform

Related

Conditional formatting of a rectangle cell range defined by user input

Is there a way to count from last condition x?

OpenMVG mainComputeMatches.cpp input explanation

How do I set different column widths for each column of a tktable?

Calculating the distance between characters

Categories

Resources