Given some short integers and the dates they represent, is there any systematic method to determine how they're stored in this format and decode other dates? The data stored is from another piece of software.
I initially thought that the days were represented by one of the bytes, since the first byte for May 1 minus the first byte for Feb 11 did equal the correct number of days (79 for year 2011). But it can't be that simple, not only because 8 bits can only store 256 days, but also because dates before 2000 store the year only, with both bytes.
Here's what I'm working with, but take the column headings with a grain of salt.
DDDDDDDD YYYYYYYY DD-MM-YY
01011010 10001010 1955
10110010 10010001 1960
11000000 10010001 1961
11100011 11000011 1996
01010001 11000110 1997
00001101 11001000 1999
10000000 11001001 10-02-00
11010101 11001010 16-01-01
10101010 11001101 11-01-03
00000101 11010000 05-09-04
10011101 11010101 07-08-08
11010000 11010101 27-09-08
00010000 11010110 30-11-08
00110100 11010110 05-01-09
11111110 11010110 26-07-09
10011101 11010111 01-01-10
10110111 11011000 10-10-10
00110011 11011001 11-02-11
00111010 11011001 18-02-11
10000010 11011001 01-05-11
10000101 11011001 04-05-11
01101100 11100110 19-05-20
I also see that 30-11-08 has the same second byte as 05-01-09, and conversely the two dates in 2010 have different values in the second byte.
EDIT: Thanks to the answers and some research, I see that the epoch is November 17, 1858. This is a standard format called the Modified Julian Day.
It looks like it's days since some point in the past ~1858 (I haven't worked out all the leap year magic), but the day of year is only displayed in your existing app for years >= 2000. The byte you marked year is the high order byte while the "day" byte is the low order byte.
Isn't it just a 16-bit days-since-epoch value? Feb 10 2000 is 51584. Feb 18 2011 is 55610. There are 4026 days between -- (11 * 365) + 3 leap days + the 8 days difference in the day of month. The start of the epoch would appear to be roughly 1860. Or, more likely, the high-order 1 bit turns on Jan 1, 1950.
Related
I have gathered satellite data (every 5 minutes, from "Solcast") for GHI, DNI and DHI and I use pvlib to get the POA value.
The pvlib function I use:
def get_irradiance(site_location, date, tilt, surface_azimuth, ghi, dni, dhi):
times = pd.date_range(date, freq='5min', periods=12*24, tz=site_location.tz)
solar_position = site_location.get_solarposition(times=times)
POA_irradiance = irradiance.get_total_irradiance(
surface_tilt=tilt,
surface_azimuth=surface_azimuth,
ghi=ghi,
dni=dni,
dhi=dhi,
solar_zenith=solar_position['apparent_zenith'],
solar_azimuth=solar_position['azimuth'])
return pd.DataFrame({'GHI': ghi,
'DNI': dni,
'DHI': dhi,
'POA': POA_irradiance['poa_global']})
When I compare GHI and POA values for 12 June 2022 and 13 June 2022 is see the POA value for 12 June is significantly behind the GHI. The location is in The Netherlands, I use a tilt of 12.5 degrees and an azimuth of 180 degrees. Here is the outcome (per hour, from 6:00 - 20:00):
12 Juni 2022
GHI DNI DHI POA
6 86.750000 312.750000 40.500000 40.277034
7 224.583333 543.000000 69.750000 71.130218
8 366.833333 598.833333 113.833333 178.974322
9 406.083333 182.000000 304.000000 348.272844
10 532.166667 266.750000 346.666667 445.422584
11 725.666667 640.416667 226.500000 509.360716
12 688.500000 329.416667 409.583333 561.630762
13 701.333333 299.750000 439.333333 570.415438
14 725.416667 391.666667 387.750000 532.529676
15 753.916667 629.166667 244.333333 407.665794
16 656.750000 599.750000 215.333333 293.832376
17 381.833333 36.416667 359.416667 356.317883
18 411.750000 569.166667 144.750000 144.254438
19 269.750000 495.916667 102.500000 102.084439
20 134.583333 426.416667 51.583333 51.370738
And
13 June 2022
GHI DNI DHI POA
6 5.666667 0.000000 5.666667 5.616296
7 113.500000 7.750000 111.416667 111.948831
8 259.500000 106.833333 208.416667 256.410392
9 509.166667 637.750000 150.583333 514.516389
10 599.333333 518.666667 240.583333 619.050821
11 745.250000 704.500000 195.583333 788.773772
12 757.250000 549.666667 292.000000 798.739403
13 742.000000 464.583333 335.000000 778.857394
14 818.250000 667.750000 243.000000 869.972769
15 800.750000 776.833333 166.916667 852.559043
16 699.000000 733.666667 167.166667 730.484502
17 582.666667 729.166667 131.916667 593.802853
18 449.166667 756.583333 83.500000 434.958210
19 290.083333 652.666667 68.666667 254.048655
20 139.833333 466.916667 48.333333 97.272684
What can be an explanation of the significantly low POA compared to the GHI values on 12 June?
I have this outcome with other days too: some days have a POA much closer to the GHI than other days. Maybe this is "normal behaviour" and I do not reckon with weather influences which maybe important...
I use the POA to do a PR (Performance Ratio) calculation but I do not get "trusted" results..
Hope someone can shine a light on these values.
Kind regards,
Oscar
The Netherlands.
I'm really sorry, although the weather is unpredictable in the Netherlands I made a very big booboo in using dd-mm-yyyy format instead of mm-dd-yyyy. Something I overlooked for a long time...(I never had used mm-dd-yyyy, but that's a lame excuse...)
Really sorry, hope you did not think about it too long..
Thank you anyway for reacting!
I've good values now!
Oscar (shame..)
I am working with three-dimensional macroeconomic panel data in Stata. My data is compiled from 51 issues of Economic Outlook(EO) from OECD, each containing data for up to 30 countries from 1960 up to 2010, where the first issue is from 1985 and the last issue is from 2010. The issues are released semiannualy and each issue has historic data as well as forecast 2 periods ahead. So for each variable there are essentially three subscripts: country (i), time the data is concerning (t), time the data was released (r).
I want to identify a fiscal policy shock as a forecast error: the forecast of public spending minus the realized value from the EO issue one period later. So, for the forecasted value, t=r-1, while for the realized value, t=r. For public spending, g, the forecast error should look like:
g_i,t,r(t=r-1) - g_i,t,r(t=r)
(if that makes sense).
I have never worked with three-dimensional panel data, so I don't know how to code with it. Currently my data looks like this:
time_str value frequency location variable year eo year_half eo_year var_cat eo_half time_cal time_eo tt_cal tt_eo id_cal id_eo time_actual
1970_1 16214 S CAN cg 1970 38 1 1985 Govt final cons expen, val, GDP exp approach 2 1970 1985.5 21 1 1 504 1970h1
1970_2 17046 S CAN cg 1970 38 2 1985 Govt final cons expen, val, GDP exp approach 2 1970.5 1985.5 22 1 1 530 1970h2
1971_1 17768 S CAN cg 1971 38 1 1985 Govt final cons expen, val, GDP exp approach 2 1971 1985.5 23 1 1 556 1971h1
1971_2 18968 S CAN cg 1971 38 2 1985 Govt final cons expen, val, GDP exp approach 2 1971.5 1985.5 24 1 1 582 1971h2
1972_1 19442 S CAN cg 1972 38 1 1985 Govt final cons expen, val, GDP exp approach 2 1972 1985.5 25 1 1 608 1972h1
1972_2 21140 S CAN cg 1972 38 2 1985 Govt final cons expen, val, GDP exp approach 2 1972.5 1985.5 26 1 1 634 1972h2
1973_1 22274 S CAN cg 1973 38 1 1985 Govt final cons expen, val, GDP exp approach 2 1973 1985.5 27 1 1 660 1973h1
1973_2 23800 S CAN cg 1973 38 2 1985 Govt final cons expen, val, GDP exp approach 2 1973.5 1985.5 28 1 1 686 1973h2
Some explanation of the data:
tt_eo = id for EO issue. In the shown example all the data is from the first issue released in 1985
tt_cal = id for the actual time (when the data is concerned)
id_eo = id for each country-variable within each actual period (time of the release changes)
id_cal = id for each country-variable within each EO issue (actual time for when the data is concerned changes)
time_eo = time of release
time_cal = actual time the data is concerned)
All my variables are not listed as variables but rather values of the variable "variable". Therefore I cannot generate anything or call on them, as Stata doesn't recognize them.
I have tried setting the data (see code below) but I still don't know how to work with the data.
*converting to time data and setting the time
gen time_actual = yh(year, year_half)
xtset id_cal time_actual, format(%th)
Does anyone have any suggestions on how to generate my forecast error variables (or generally how to work with this type of data)?
I am trying to reverse engineer the Fujitsu AC remote control protocol for a home automation project. I have gotten as far as identifying which bytes correspond to which control information, however there is a checksum at the end.
I believe the checksum is calculated using three other bytes (temperature, mode and fan speed).
I have used a spreadsheet to try reverse engineering what operations have been performed to get the checksum and found that for a temperature of "00001010" and any mode/fan speed combination the following algorithm holds true;
Checksum = 392 - (Temperature + Mode + Fan Speed)
Example
392 - (10 + 64 + 128) = 190
392 - (10 + 192 + 128) = 62
392 - (10 + 32 + 128) = 222
However no other temperature (that I have tested) works this way. My current theory is that the temperature has some other operation performed on it first and that whatever this operation is results in the same value for a temperature of "00001010", but not other temperatures.
Raw data:
Temperature, Mode, Fan Speed, Checksum
00000110, 10000000, 10000000, 01110110
00001010, 10000000, 10000000, 01111110
00000010, 10000000, 10000000, 01110001
Full spreadsheet at: This link
I can't work out what operation(s) are being performed on the temperature, or in fact if I am even correct in my assumptions about what the algorithm is.
I'm wondering if there is anyone with more experience with this kind of problem that might be able to shed some light on this?
Extras:
The temperature value is the integer of the temperature say 21 degrees (00010101)
1. Reversed to get 10101000
2. Only the first four bits taken - 1010
3. Then expanded to get a value of 00001010
So 00001010 in the raw data above is a temperature of 21 degrees
Original question has been edited as I was originally approaching this incorrectly and assuming my hypothesis was correct
I found the following solution after some more sifting through Google search results.
Thanks George Dewar on Github
1. Reverse (flip) bytes 8 - 13 (I - N in spreadsheet)
2. Sum those bytes
3. (208 - sum) % 256
4. Reverse (flip) bytes of result
E.g.
Data: 00000110, 10000000, 10000000, 00000000, 00000000, 00000000
1. Reverse:
01100000, 00000001, 00000001, 00000000, 00000000, 00000000
96, 1, 1, 0, 0, 0
2. Sum:
96 + 1 + 1 + 0 + 0 + 0 = 98
3. Calculate:
(208 - 98) % 256 = 110 (dec) or 01101110 (bin)
4. Reverse:
01110110
Answer provided by #george-dewar on Github. So a massive thank you to him. I would never have worked that out. Mine only differs in that my remote has less options and therefore less bytes to reverse and sum, otherwise it works exactly as George has it in his example code.
Let's say I have a Pandas DataFrame with two columns: 1) user_id, 2) steps (which contains the number of steps on the given date). Now I want to calculate the difference between the number of steps and the number of steps in the preceding measurement (measurements are guaranteed to be in order within my DataFrame).
So basically this comes down to appending an extra column to my DataFrame where the row values of this data frame match the value of the column 'steps' within this same row, minus the value of the 'steps' column in the row above (or 0 if this is the first row). To complicate things further, I want to calculate these differences per user_id, so I want to make sure that I do not subtract the steps values of two rows with different user_id's.
Does anyone have an idea how to get this done with Python 2.7 and Panda?
So an example to illustrate this.
Example input:
user_id steps
1015 48
1015 23
1015 79
1016 10
1016 20
Desired output:
user_id steps d_steps
1015 48 0
1015 23 -25
1015 79 56
2023 10 0
2023 20 10
Your output shows user ids that are not in you orig data but the following does what you want, you will have to replace/fill the NaN values with 0:
In [16]:
df['d_steps'] = df.groupby('user_id').transform('diff')
df.fillna(0, inplace=True)
df
Out[16]:
user_id steps d_steps
0 1015 48 0
1 1015 23 -25
2 1015 79 56
3 1016 10 0
4 1016 20 10
Here we generate the desired column by calling transform on the groupby by object and pass a string which maps to the diff method which subtracts the previous row value. Transform applies a function and returns a series with an index aligned to the df.
Let me preface this with I am new at using pandas so I'm sorry if this question is basic or answered before, I looked online and couldn't find what I needed.
I have a dataframe that consists of a baseball teams schedule. Some of the games have been played already and as a result the results from the game are inputed in the dataframe. However, for games that are yet to happen, there is only the time they are to be played (eg 1:35 pm).
So, I would like to convert all of the values of the games yet to happen into Na's.
Thank you
As requested here is what the results dataframe for the Arizona Diamondbacks contains
print MLB['ARI']
0 0
1 0
2 0
3 1
4 0
5 0
6 0
7 0
8 1
9 0
10 1
...
151 3:40 pm
152 8:40 pm
153 8:10 pm
154 4:10 pm
155 4:10 pm
156 8:10 pm
157 8:10 pm
158 1:10 pm
159 9:40 pm
160 8:10 pm
161 4:10 pm
Name: ARI, Length: 162, dtype: object
Couldn't figure out any direct solution, only iterative
for i in xrange(len(MLB)):
if 'pm' in MLB.['ARI'].iat[i] or 'am' in MLB.['ARI'].iat[i]:
MLB.['ARI'].iat[i] = np.nan
This should work if your actual values (1s and 0s) are also strings. If they are numbers, try:
for i in xrange(len(MLB)):
if type(MLB.['ARI'].iat[i]) != type(1):
MLB.['ARI'].iat[i] = np.nan
The more idiomatic way to do this would be with the vectorised string methods.
http://pandas.pydata.org/pandas-docs/stable/basics.html#vectorized-string-methods
mask = MLB['ARI'].str.contains('pm') #create boolean array
MLB['ARI'][mask] = np.nan #the column names goes first
Create the boolean array from and then use it to select the data you want.
Make sure that the column name goes before the masking array, otherwise you'll be acting on a copy of the data and your original dataframe wont get updated.
MLB['ARI'][mask] #returns a view on MLB datafrmae, will be updated
MLB[mask]['ARI'] #returns a copy of MLB, wont be updated.