Change data in table and copying to new table

Change data in table and copying to new table - list

I would like to make a macro in Excel, but I think it's too complicated to do it with recording... That's why I'm coming here for assistance.
The file:
I have a list of warehouse boxes all containing a specific ID, location (town), location (in or out) and a date.
Whenever boxes change location, this needs to be changed in this list and the date should be adjusted accordingly (this should be a manual input, since the changing of the list might not happen on the same day as the movement of the box).
On top of that, I need to count the number of times the location changes from in to out (so that I know how many times the box has been used).
The way of inputting:
A good way of inputting would be that you can make a list of the boxes where you want to change the information from, f.e.:
ID | Location (town) | Location (in/out) | Date
------------------------------------------------
123-4 | Paris | OUT | 9-1-14
124-8 | London | IN | 9-1-14
999-84| London | IN | 10-1-14
124-8 | New York | OUT | 9-1-14
Then I'd make a button that uses the macro to change the data mentioned above in the master list (where all the boxes are listed) and on some way count the number of times OUT changes to IN etc.
Is this possible?

I'm not entirely sure what you want updated in your Main List but I don't think you need Macros at all to achieve this. You can count the number of times and box location has changed by simply making a list of all your boxes in one column and the count in the next column. For the count use the formula COUNTIFS to count all the rows where the box id is the same and the location is in/out. Check VLOOKUP for updating your main list values.

Related

Sorting query by distance requires reading entire data set?

To perform geoqueries in DynamoDB, there are libraries in AWS (https://aws.amazon.com/blogs/mobile/geo-library-for-amazon-dynamodb-part-1-table-structure/). But to sort the results of a geoquery by distance, the entire dataset must be read, correct? If a geoquery produces a large number of results, there is no way to paginate that (on the backend, not to the user) if you're sorting by distance, is there?

You are correct. To sort all of the datapoint by distance from some arbitrary location, you must read all the data from your DynamoDB table.
In DynamoDB, you can only sort results using a pre-computed value that has been stored in the DynamoDB table and is being used as the sort key of the table or one of its indexes. If you need to sort by distance from a fixed location, then you can do this with DynamoDB.
Possible Workaround (with limitations)
TLDR; it's not such a bad problem if you can get away with only sorting the items that are within X kms from an arbitrary point.
This still involves sorting the data points in memory, but it makes the problem easier by producing incomplete results (by limiting the maximum range of the results.)
To do this, you need the Geohash of your point P (from which you are measuring the distance of all other points). Suppose it is A234311. Then you need to pick what range of results is appropriate. Let's put some numbers on this to make it concrete. (I'm totally making these numbers up because the actual numbers are irrelevant for understanding the concepts.)
A - represents a 6400km by 6400km area
2 - represents a 3200km by 3200km area within A
3 - represents a 1600km by 1600km area within A2
4 - represents a 800km by 800km area within A23
3 - represents a 400km by 400km area within A234
1 - represents a 200km by 200km area within A2343
1 - represents a 100km by 100km area within A23431
Graphically, it might look like this:
View of A View of A23
|----------|-----------| |----------|-----------|
| | A21 | A22 | | | |
| A1 |-----|-----| | A231 | A232 |
| | A23 | A24 | | | |
|----------|-----------| |----------|-----------|
| | | | |A2341|A2342|
| A3 | A4 | | A233 |-----|-----|
| | | | |A2343|A2344|
|----------|-----------| |----------|-----------| ... and so on.
In this case, our point P is in A224132. Suppose also, that we want to get the sorted points within 400km. A2343 is 400km by 400km, so we need to load the result from A2343 and all of its 8-connected neighbors (A2341, A2342, A2344, A2334, A2332, A4112, A4121, A4122). Then once we've loaded only those in memory, then you calculate the distances, sort them, and discard any results that are more than 400km.
(You could keep the results that are more than 400km away as long as the users/clients know that beyond 400km, the data could be incomplete.)
The hashing method that DynamoDB Geo library uses is very similar to a Z-Order Curve—you may find it helpful to familiarize yourself with that method as well as Part 1 and Part 2 of the AWS Database Blog on Z-Order Indexing for Multifaceted Queries in DynamoDB.

Not exactly. When querying location you can query by a fixed query value (partition key value) and by sort key, so you can limit your query data result and also apply a little filtering.
I have been racking my brain while designing a DynamoDB Geo Hash proximity locator service. For this example customer_A wants to find all service providers_X in their area. All customers and providers have a 'g8' key that stores their precise geoHash location (to 8 levels).
The accepted way to accomplish this search is to generate a secondary index from the main table with a less accurate geoHash 'g4' which gives a broader area for the main query key. I am applying key overloading and composite key structures for a single table design. The goal in this design is to return all the data required in a single query, secondary indexes can duplicate data by design (storage is cheap but cpu and bandwidth is not)
GSI1PK GSI1SK providerId Projected keys and attributes
---------------------------------------------
g4_9q5c provider pr_providerId1 name rating
g4_9q5c provider pr_providerId2 name rating
g4_9q5h provider pr_providerId3 name rating
Scenario1: customer_A.g8_9q5cfmtk So you issue a query where GSI1PK=g4_9q5c and a list of two providers is returned, not three I desire.
But using geoHash.neighbor() will return eight surrounding neighbors like 9q5h (see reference below). That's great because there a provider in 9q5h but this means I have to run nine queries, one on the center and eight on the neighbors, or run 1-N until I have the minimum results I require.
But which direction to query second, NW, SW, E?? This would require another level of hinting toward which neighbor has more results, without knowing first, unless you run a pre-query for weighted results. But then you run the risk of only returning favorable neighbors as there could be new providers in previously unfavored neighbors. You could apply some ML and randomized query into neighbors to check current counts.
Before the above approach I tried this design.
GSI1PK GSI1SK providerId Projected keys and attributes
---------------------------------------------
loc g8_9q5cfmtk pr_provider1
loc g8_9q5cfjgq pr_provider2
loc g8_9q5fe954 pr_provider3
Scenario2: customer_A.g8_9q5cfmtk So you issue a query where GSI1PK=loc and GSI1SK in between g8_9q5ca and g8_9q5fz and a list of three providers is returned, but a ton of data was pulled and discarded.
To achieve the above query the between X and Y sort criteria is composed of. 9q5c.neighbors().sorted() = 9q59, 9q5c, 9q5d, 9q5e, 9q5f, 9q5g, 9qh1, 9qh4, 9qh5. So we can just use X=9q59 and Y=9qh5 but there are over 50 (I really didn't count after 50) matching quadrants in such a UTF between function.
Regarding the hash/size table above I would recommend to use this https://www.movable-type.co.uk/scripts/geohash.html
Geohash length Cell width Cell height
1 ≤ 5,000km × 5,000km
2 ≤ 1,250km × 625km
3 ≤ 156km × 156km
4 ≤ 39.1km × 19.5km
5 ≤ 4.89km × 4.89km
...

How do I keep a running count in DynamoDB without a hot table row?

We have a completely server-less architecture and have been using DynamoDB almost since it was released, but I am struggling to see how to deal with tabulating global numbers on a large scale. Say we have users who choose to do either A or B. We want to keep track of how many users do each and they could happen at a high scale. According to DyanamoDB best practices, you are not supposed to write continually to a single row. What is the best way to handle this outside using another service like CouchDB or ElastiCache?

You could bucket your users by first letter of their usernames (or something similar) as the partition key, and either A or B as the sort key, with a regular attribute as the counts.
For example:
PARTITION KEY | SORT KEY | COUNT
--------------------------------
a | A | 5
a | B | 7
b | B | 15
c | A | 1
c | B | 3
The advantage is that you can reduce the risk of hot partitions by spreading your writes across multiple partitions.
Of course, you're trading hot writes for more expensive reads, since now you'll have to scan + filter(A) to get the total count that chose A, and another scan + filter(B) for the total count of B. But if you're writing a bunch and only reading on rare occasions, this may be ok.

Finding the overlapping locations for a 5 mile and 10 mile radius for a list of location data with latittude and longitude

I have a 10,000 observation dataset with a list of location information looking like this:
ADDRESS | CITY | STATE | ZIP |LATITUDE |LONGITUDE
1189 Beall Ave | Wooster | OH | 44691 | 40.8110501 |-81.93361870000001
580 West 113th Street | New York City | NY | 10025 | 40.8059768 | -73.96506139999997
268 West Putnam Avenue | Greenwich | CT | 06830 | 40.81776801 |-73.96324589997
1 University Drive | Orange | CA | 92866 | 40.843766801 |-73.9447589997
200 South Pointe Drive | Miami Beach | FL | 33139 | 40.1234801 |-73.966427997
I need to find the overlapping locations within a 5 mile and 10 mile radius. I heard that their is a function called geodist which may allow me to do that, although I have never used it. The problem is that for geodist to work I may need all the combinations of the latitudes and longitudes to be side by side, which may make the file really really large and hard to use. I also, do not know how I would be able to get the lat/longs for every combination to be side by side.
Does anyone know of a way I could get the final output that I am looking for ?

Here is a broad outline of one possible approach to this problem:
Allocate each address into a latitude and longitude 'grid' by rounding the co-ordinates to the nearest 0.01 degrees or something like that.
Within each cell, number all the addresses 1 to n so that each has a unique id.
Write a datastep taking your address dataset as input via a set statement, and also load it into a hash object. Your dataset is fairly small, so you should have no problems fitting the relevant bits in memory.
For each address, calculate distances only to other addresses in the same cell, or other cells within a certain radius, i.e.
Decide which cell to look up
Iterate through all the addresses in that cell using the unique id you created earlier, looking up the co-ordinates of each from the hash object
Use geodist to calculate the distance for each one and output a record if it's a keeper.
This is a bit more work to program, but it is much more efficient than an O(n^2) brute force search. I once used a similar algorithm with a dataset of 1.8m UK postcodes and about 60m points of co-ordinate data.

Using Index and Match functions to return a value from multiple worksheets in a workbook

I have a url report that gets generated on a running weekly basis. Each week the report generates a new worksheet within a workbook that keeps around 6 months worth of data at a time. I want to find and pull the data on a specific url from the worksheets and display them in a new worksheet.
For example data in a worksheet might look like:
Week of Mar 9
URL | Visits | Conversions
mysite.com/apple | 300 | 10
mysite.com/banana | 100 | 20
mysite.com/pear | 600 | 5
And each worksheet in the workbook is a different week, such as Mar 2, Feb 23, etc.
Now, I want every Apple url in one worksheet so that I can compare...Apples to Apples...(pun intended). Since there are hundreds of these I can't afford the time to manually do this for each segment I need, so I tried the following.
=INDEX('312015'!8:999,MATCH("apple",'312015'!8:999,-1))
I'm uncertain of which switch to use for Match, other than 0 is "exact match" from what I read online, so I tried both 1 and -1 to get a not-exact match, though reality is I probably need a partial-match since apple is only part of the url.
Any suggestions on how to get this to work or a better way to do this in excel would be great. Also, I can not manipulate the report output themselves as it comes from a third party vendor and I've already asked them about adjusting this.
I thought about using vlookup as well, but I believe that only returns the first result with that value and not multiple ones.

Assuming URL in ColumnA, Visits in ColumnB, Conversions in ColumnC for your source data, and in another sheet your page (fruit: apple/banana/pear) in A1,Visits in B1, Conversions in C1 and sheet names in A2 downwards, then I suggest in B2:
=INDEX(INDIRECT($A2&"!B:B"),MATCH("*"&$A$1,INDIRECT($A2&"!A:A"),0))
in C2:
=INDEX(INDIRECT($A2&"!C:C"),MATCH("*"&$A$1,INDIRECT($A2&"!A:A"),0))
and the two formulae copied down to suit.
This is looking for an exact match but does so with a wildcard.

Huge amout data analysis

Say we have about 1e10 lines of log file everyday, each one contains: an ID number(integer below 15 digits length), a login time, and a logout time. Some ID may login and logout several times.
Question 1:
How to count the total number of ID that have logged in?(We should not count each ID twice or more)
I tried to use a hashtable here, but I found the memory we should obtained may be so large.
Question 2:
Calculate the time when the population of online users are largest.
I think we may split the time of a day into 86400 seconds, then for each line of log file, add 1 to each seconds in the online interval. Or maybe I can sort the log file by login time?

you can do that in a *nix shell.
cut -f1 logname.log | sort | uniq | wc -l
cut -f2 logname.log | sort | uniq -c | sort -r

For question 2 to make sense: you probably have to log 2 things: user logs in and user logs out. Two different activities along with the user id. If this list is sorted by the time in which the activity (either log in or log out was done). You just scan with a counter called currentusers: add 1 for each log in and -1 for each log out. The maximum that number (current users) reaches is the value you're interested in, you will probably be interested also in tracking at what time it occurred..

For question 1, forget C++ and use *nix tools. Assuming the log file is space delimited, then the number of unique logins in a given log is computed by:
$ awk '{print $1}' foo.log | sort | uniq | wc -l
Gnu sort, will happily sort files larger than memory. Here's what each piece is doing:
awk is extracting the first space-delimited column (the ID number).
sort is sorting those ID numbers, because uniq needs sorted input.
uniq is returning only uniq numbers.
wc prints the number of lines, which will be the number of uniq numbers.

use a segment tree to store intervals of consecutive ids.
Scan the logs for all the login events.
To insert an id, first search a segment containing the id: if it exists, the id is a duplicate. If it doesn't search the segments which are right after or before the id. If they exist, remove them and merge the new id as needed, and insert the new segment. If they don't exist, insert the id as a segment of 1 element.
Once all ids have been inserted, count their number by summing the cardinals of all the segments in the tree.
assuming that:
a given id may be logged in only once at any given time,
events are stored in chronological order (that's what logs are normally)
Scan the log and keep a counter c of the number of currently logged in users, as well as the max number m found, and the associated time t. For each log in, increment c, and for each log out decrement it. At each step update m and t if m is lower than c.

For 1, you can try working with fragments of the file at a time that are small enough to fit into memory.
i.e. instead of
countUnique([1, 2, ... 1000000])
try
countUnique([1, 2, ... 1000]) +
countUnique([1001, 1002, ... 2000]) +
countUnique([2001, 2002, ...]) + ... + countUnique([999000, 999001, ... 1000000])
2 is a bit more tricky. Partitioning the work into manageable intervals (a second, as you suggested) is a good idea. For each second, find the number of people logged in during taht second by using the following check (pseudocode):
def loggedIn(loginTime, logoutTime, currentTimeInterval):
return user is logged in during currentTimeInterval
Apply loggedIn to all 86400 seconds, and then maximize the list of 86400 user counts to find the time that the population of online users is the largest.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js