The question about transformation data in Power BI.
I have a text file with spaces as separators. Some rows (where day in date less than 10) contain double space before one field. It is always the third field.
Tue May 4 13:57:50 BST 2021: 64 bytes from 8.8.8.8: icmp_seq=12 ttl=119 time=9.22 ms
Tue May 4 13:58:05 BST 2021: 64 bytes from 8.8.8.8: icmp_seq=13 ttl=119 time=10.2 ms
Tue May 4 13:58:20 BST 2021: 64 bytes from 8.8.8.8: icmp_seq=14 ttl=119 time=8.77 ms
Tue May 4 13:58:35 BST 2021: 64 bytes from 8.8.8.8: icmp_seq=15 ttl=119 time=9.69 ms
Tue May 4 13:58:50 BST 2021: 64 bytes from 8.8.8.8: icmp_seq=16 ttl=119 time=9.22 ms
So I split this file by spaces and some rows are split for 15 columns and some for 16.
I do a lot of transformations with this file then so I need to be able to make the conditional transformation. I didn't find any solutions by myself, so I'll be appreciated for advice.
I found a few solutions.
Shortly, The first one is using a standard function Split Column which could be used in few steps.
Split by positions the first 3 fields
Clear it from spaces
Split by space the rest
The second way to do something similar is using Python. To successfully insert step with Python script in your current transformation you should:
Go to Transform data -> Advanced editor and copy current steps, because after insertion Python step we should do a Navigation step, which replaces all existing steps.
Find a step where you want to insert a new Python transformation and click on Transform -> Run Python script.
Then write code and save the result
Then click on dataset name and agreed with replacing all steps
Now copy your previous steps and past them back in Advanced editor
I described the solutions here
https://koftaylov.blogspot.com/2021/05/another-day-building-power-bi-dashboard.html
Related
Very new to regex so question one would be is this possible?
I have products that can be in multiple categories/ subcategories, but for reporting, I just want to attribute them once per top category.
Original data:
1010,1012,1012610,1014243,10147048956,2010,201150205,2011506,2015470
Desired Result:
1010,1012,1014,2010,2011,2015
Details
1010 is unchanged
1012,1012610 reduce to 1 instance of 1012
1014243,10147048956 reduce to 1 instance of 1014
2010 is unchanged
201150205,2011506 reduce to 1 instance of 2011
2015470 is reduced to 2015
My current pattern (?|(10..)|(20..)) works well with exception to the following bold sections:
1010,1012,1012610,1014243,10147048956,2010,201150205,2011506,2015470
As for reducing, I am at a loss for where to start.
Thank you in advance for any assistance or direction.
\b(\w{4})
1010,1012,1012610,1014243,10147048956,2010,201150205,2011506,2015470
after applying regex "\b(\w{4})" can you collect values in Set it will make those element unique.
I have a use case where I want to cross compare 2 sets of images to know the best similar pairs.
However, the sets are quite big, and for performance purposes I don't want to open and close images all the time.
So my idea is:
std::map<int, Magic::Image> set1;
for(...) { set1[...] = Magic::Image(...);}
std::map<int, int> best;
for(...) {
set2 = Magic::Image(...);
//Compare with all the set1
...
best[...] = set1[...]->first;
}
Obviusly I don't need to store all the set 2, since I work image by image.
But in any case the set1 is already so big that storing 32bit images is too much. For reference: 15000 images, 300x300 = 5GB
I though about reducing the memory by downsampling the images to monochrome (it does not affect my use case). But how to do it? Even if I get a color channel, Image-Magick still threats the new image as 32bits, even if it is just a channel.
My final approach has been to write a self-parser that reads color by color, converts it, and creates a bit-vector. Then do XORs and count bits. That works. (using only 170 MB)
However, is not flexible. What if I want to use 2bits, or 8 bits at some point? Is it possible in any way using Imagemagick own classes and just call compare()?
Thanks!
I have a couple of suggestions - maybe something will give you an idea!
Suggestion 1
Maybe you could use a Perceptual Hash. Rather than holding all your images in memory, you calculate a hash one at a time for each image and then compare the distance between the hashes.
Some pHASHes are invariant to image scale (or you can scale all images to the same size before hashing) and most are invariant to image format.
Here is an article by Dr Neal Krawetz... Perceptual Hashing.
ImageMagick can also do Perceptual Hashing and is callable from PHP - see here.
I also wrote some code some time back for this sort of thing... code.
Suggestion 2
I understand that ImageMagick Version 7 is imminent - no idea who could tell you more - and that it supports true single-channel, grayscale images - as well as up to 32 channel multi-spectral images. I believe it can also act as a server - holding images in memory for subsequent use. Maybe that can help.
Suggestion 3
Maybe you can get some mileage out of GNU Parallel - it can keep all your CPU cores busy in parallel and also distribute work across a number of servers using ssh. There are plenty of tutorials and examples out there, but just to demonstrate comparing each item of a named set of images (a,b,c,d) with each of a numbered set of images (1,2), you could do this:
parallel -k echo {#} compare {1} {2} ::: a b c d ::: 1 2
Output
1 compare a 1
2 compare a 2
3 compare b 1
4 compare b 2
5 compare c 1
6 compare c 2
7 compare d 1
8 compare d 2
Obviously I have put echo in there so you can see the commands generated, but you can remove that and actually run compare.
So, your code might look more like this:
#!/bin/bash
# Create a bash function that GNU Parallel can call to compare two images
comparethem() {
result=$(convert -metric rmse "$1" "$2" -compare -format "%[distortion]" info:)
echo Job:$3 $1 vs $2 $result
}
export -f comparethem
# Next line effectively uses all cores in parallel to compare pairs of images
parallel comparethem {1} {2} {#} ::: set1/*.png ::: set2/*.png
Output
Job:3 set1/s1i1.png vs set2/s2i3.png 0.410088
Job:4 set1/s1i1.png vs set2/s2i4.png 0.408234
Job:6 set1/s1i2.png vs set2/s2i2.png 0.406902
Job:7 set1/s1i2.png vs set2/s2i3.png 0.408173
Job:8 set1/s1i2.png vs set2/s2i4.png 0.407242
Job:5 set1/s1i2.png vs set2/s2i1.png 0.408123
Job:2 set1/s1i1.png vs set2/s2i2.png 0.408835
Job:1 set1/s1i1.png vs set2/s2i1.png 0.408979
Job:9 set1/s1i3.png vs set2/s2i1.png 0.409011
Job:10 set1/s1i3.png vs set2/s2i2.png 0.407391
Job:11 set1/s1i3.png vs set2/s2i3.png 0.408614
Job:12 set1/s1i3.png vs set2/s2i4.png 0.408228
Suggestion 3
I wrote an answer a while back about using REDIS to cache images - that can also work in a distributed fashion amongst a small pool of servers. That answer is here.
Suggestion 4
You may find that you can get better performance by converting the second set of images to Magick Pixel Cache format so that they can be DMA'ed into memory rather than needing to be decoded and decompressed each time. So you would do this:
convert image.png image.mpc
which gives you these two files which ImageMagick can read really quickly.
-rw-r--r-- 1 mark staff 856 16 Jan 12:13 image.mpc
-rw------- 1 mark staff 80000 16 Jan 12:13 image.cache
Note that I am not suggesting you permanently store your images in MPC format as it is unique to ImageMagick and can change between releases. I am suggesting you generate a copy in that format just before you do your analysis runs each time.
(Stata/MP 13.1)
I am working with a set of massive data sets that takes an extremely long time to load. I am currently looping through all the data sets to load them each time.
Is it possible to just tell Stata to load in the first 5 observations of each dataset (or in general the first n data sets in each use command) without actually having to load the entire data set? Otherwise, if I were to load in the entire data set and then just keep the first 5 observations, the process takes extremely long time.
Here are two work-arounds I have already tried
use in 1/5 using mydata : I think this is more efficient than just loading the data and then keeping the observations you want in a different line, but I think it still reads in the entire data set.
First load all the data sets, then save copies of all the data sets to just be the first 5 observations, and then just use the copies: This is cumbersome as I have a lot of different files; I would very much prefer just a direct way to read in the first 5 observations without having to resort to this method and without having to read the entire data set.
I'd say using in is the natural way to do this in Stata, but testing shows
you are correct: it really makes no "big" difference, given the size of the data set. An example is (with 148,000,000 observations)
sysuse auto, clear
expand 2000000
tempfile bigfile
save "`bigfile'", replace
clear
timer on 1
use "`bigfile'"
timer off 1
clear
timer on 2
use "`bigfile'" in 1/5
timer off 2
timer list
timer clear
Resulting in
. timer list
1: 6.44 / 1 = 6.4400
2: 4.85 / 1 = 4.8480
I find that surprising since in seems really efficicient in other contexts.
I would contact Stata Tech support (and/or search around, including www.statalist.com) only to ask why in isn't much faster
(independently of you finding some other strategy to handle this problem).
It's worth using, of course; but not fast enough for many applications.
In terms of workflow, your second option might be the best. Leave the computer running while the smaller datasets are created (use a for loop), and get back to your regular coding/debugging once that's finished. This really depends on what you're doing, so it may work or not.
Actually, I found the solution. If you run
use mybigdata if runiform() <= 0.0001
Stata will take a random sample of 0.0001 of the data set without reading the entire data set.
Thanks!
Vincent
Edit: 4/28/2015 (1:58 PM EST)
My apologies. It turns out the above was actually not a solution to the original question. It seems that on my system there was high variability in the speed of using
use mybigdata if runiform() <= 0.0001
each time I ran it. When I posted that the above was a solution, I think when I ran the code, it just happened to be a faster instance. However, as I now am repeatedly running
use mybigdata if runiform() <= 0.0001
vs.
use in 1/5 using mydata
I am actually finding that
use in 1/5 using mydata
is on average faster.
In general, my question is simply how to read in a portion of a Stata data set without having to read in the entire data set for computational purposes especially when the data set is really large.
Edit: 4/28/2015 (2:50 PM EST)
In total, I have 20 datasets, each with between 5 - 15 million observations. I only need to keep 8 of the variables (There are 58-65 variables in each data set). Below is the output from the first four "describe, short" statements.
2004 action1
Contains data from 2004action1.dta
obs: 15,039,576
vars: 64 30 Oct 2014 17:09
size: 2,827,440,288
Sorted by:
2004 action2578
Contains data from 2004action2578.dta
obs: 13,449,087
vars: 59 30 Oct 2014 17:16
size: 2,098,057,572
Sorted by:
2005 action1
Contains data from 2005action1.dta
obs: 15,638,296
vars: 65 30 Oct 2014 16:47
size: 3,143,297,496
Sorted by:
2005 action2578
Contains data from 2005action2578.dta
obs: 14,951,428
vars: 59 30 Oct 2014 17:03
size: 2,362,325,624
Sorted by:
Thanks!
Vincent
I came across a problem where I could not find an elegant way to solve it...
We have an application that monitors audio-input and tries to assign matches based on acoustic fingerprints.
The application gets a sample every few seconds, then does a lookup and stores the timestamped result in the database.
The fingerprinting is not always accurate, so it happens that "wrong" items get assigned. So the data looks something like:
timestamp foreign_id my comment
--------------------------------------------------
12:00:00 17
12:00:10 17
12:00:20 17
12:00:30 17
12:00:40 723 wrong match
12:00:50 17
12:01:00 17
12:01:10 17
12:01:20 None no match
12:01:30 17
12:01:40 18
12:01:50 18
12:02:00 18
12:02:10 18
12:02:20 18
12:02:30 992 wrong match
12:02:40 18
12:02:50 18
So I'm looking for a way to "clean up" the data periodically.
Could anyone imagine a nice way to achieve this? In the given example - the entry with the foreign-id of 723 should be corrected to 17 etc. And - if possible - with a threshold about how many entries back and forth should be taken into account.
Not sure if my question is clear enough this way, but any inputs welcome!
Check that a foreign id is in the database so many times, then check if those times are close together?
Why not just disregard the 'bad' data when using the data?
What's the best way to write a dataset to a file that is frequently changing?
i.e a 12 meg dataset that has 4 kb segments that change every 2 seconds. Re-writing the entire 12 megs seems like a waste.
Is there anyway to do this using C/C++?
Yes you can save from a particular offset in a file. WIth c it is the seek command so if you look for something similar in C++ you probably will find it.
See http://www.cplusplus.com/reference/clibrary/cstdio/fseek/ for an example