Store data group by datetime column using pig - mapreduce

Say that I have the dataset like this
1, 3, 2015-03-25 11-15-13
1, 4, 2015-03-26 11-16-14
1, 4, 2015-03-25 11-16-15
1, 5, 2015-03-27 11-17-11
...
I want to store the data by datetime
so I will have the following output folders
2015-03-25/
2015-03-26/
2015-03-27/
...
How to do that with pig?
Thank you

You can use MultiStorage for this.
Use a FOREACH GENERATE to create a column that contains the date part that you are interested and then something like
STORE X INTO '/my/home/output' USING MultiStorage('/my/home/output','2');

Related

If cell contains a certain text, return a specific drop down list item (Google Sheets)

I've created a quote form, and in one cell (C6:H9) I enter the address with the city name.
In another cell (C31:F31), I have a drop down list with different city names and when a city is chosen from the drop down list, it displays a certain percentage number in another cell beside it (G31), which is the tax pertaining to that city.
I'm trying to figure out how to get the drop down list cell (C31:F31) to return an item on the drop down list if a certain city text is typed in the address cell (C6:H9).
But I'm having a hard time figuring out how to do so.
try:
=IFNA(VLOOKUP(VLOOKUP(REGEXEXTRACT(C6,
TEXTJOIN("|", 1, 'Tax Rates'!E3:E)), 'Tax Rates'!E3:F, 2, 0), 'Tax Rates'!A3:B, 2, 0))
update:
=IFNA(VLOOKUP(IFERROR(VLOOKUP(REGEXEXTRACT(C6,
TEXTJOIN("|", 1, 'Tax Rates'!E3:E)), 'Tax Rates'!E3:F, 2, ),
REGEXEXTRACT(C6, TEXTJOIN("|", 1, 'Tax Rates'!A3:A))), 'Tax Rates'!A3:B, 2, ))

PySpark Using collect_list to collect Arrays of Varying Length

I am attempting to use collect_list to collect arrays (and maintain order) from two different data frames.
Test_Data and Train_Data have the same format.
from pyspark.sql import functions as F
from pyspark.sql import Window
w = Window.partitionBy('Group').orderBy('date')
# Train_Data has 4 data points
# Test_Data has 7 data points
# desired target array: [1, 1, 2, 3]
# desired MarchMadInd array: [0, 0, 0, 1, 0, 0, 1]
sorted_list_diff_array_lens = train_data.withColumn('target',
F.collect_list('target').over(w)
)\
test_data.withColumn('MarchMadInd', F.collect_list('MarchMadInd').over(w))\
.groupBy('Group')\
.agg(F.max('target').alias('target'),
F.max('MarchMadInd').alias('MarchMadInd')
)
I realize the syntax is incorrect with "test_data.withColumn", but I want to select the array for the MarchMadInd from the test_date, but the array for the target from the train_data. The desired output would look like the following:
{"target":[1, 1, 2, 3], "MarchMadInd":[0, 0, 0, 1, 0, 0, 1]}
Context: this is for a DeepAR time series model (using AWS) that requires dynamic features to include the prediction period, but the target should be historical data.
The solution involves using a join as recommended by pault.
Create a dataframe with dynamic features of length equal to Training + Prediction period
Create a dataframe with target values of length equal to just the Training period.
Use a LEFT JOIN (with the dynamic feature data on LEFT) to bring these dataframes together
Now, using collect_list will create the desired result.

PANDAS: When Writing To Excel Change to 1904 Date System

Hopefully this is a super easy question, but while writing to a book, it would simplify my work if I could have it set to a 1904 date upon creation. I am currently doing it with a Macro but is creating the issue of adding 4 years to all my date fields when I do it in that order.
Is it possible while setting up excel writer to have it auto create the book set to 1904?
Thank you!
Andy
As Troy points out it can be done from XlsxWriter via the constructor. It is also possible to pass this parameter to the xlsxwriter engine in Pandas:
import pandas as pd
from datetime import date
df = pd.DataFrame({'Dates': [date(2018, 1, 1),
date(2018, 1, 2),
date(2018, 1, 3),
date(2018, 1, 4),
date(2018, 1, 5)],
})
writer = pd.ExcelWriter("pandas_example.xlsx",
engine='xlsxwriter',
options={'date_1904': True})
df.to_excel(writer, sheet_name='Sheet1')
Option in the output file:
See the Passing XlsxWriter constructor options to Pandas section of the XlsxWriter docs.
You can do it with xlsxwriter, but I don't think there's a direct way from pandas.
workbook = xlsxwriter.Workbook(filename, {'date_1904': True})
xlsxwriter.readthedocs.io/workbook.html

Parse textfile without fixed structure using python dictionary and Pandas

I have .txt file without specific separators and to parse it, I need to count character by character to know where starts and ends a column. To do so, I constructed a Python dictionary where the keys are the column names and the values are the number of characters that takes each column:
headers = {first_col: 3, second_col: 5, third_col: 2, ... nth_col: n_chars}
Having that in mind, I know that the three first columns of the following line in the .txt file
ABC123-3YN0000000001203ABC123*TESTINGLINE
first_col: ABC
second_col: 123-3
third_col: YN
I want to know if there is any pandas function that helps me to parse this .txt taking into account this particular condition and (if possible) using my headers dictionary.
Using a dictionary is dangerous because the order is not guaranteed. Meaning, if you picked third_col first, you've thrown of your entire scheme. You can fix this by using lists. From there, you can use pd.read_fwf to read a fixed with formatted text file.
Solution
names = ['first_col', 'second_col', 'third_col']
widths = [3, 5, 2]
pd.read_fwf(
'myfile.txt',
widths=widths,
names=names
)
first_col second_col third_col
0 ABC 123-3 YN
You can also use OrderedDict from the collections library and make sure you keep the order you want by passing an iterator that produces tuples in the correct order
from collections import OrderedDict
names = ['first_col', 'second_col', 'third_col']
widths = [3, 5, 2]
header = OrderedDict(zip(names, widths))
pd.read_fwf(
'myfile.txt',
widths=header.values(),
names=header.keys()
)
first_col second_col third_col
0 ABC 123-3 YN
Demonstration
from collections import OrderedDict
txt = """ABC123-3YN0000000001203ABC123*TESTINGLINE"""
names = ['first_col', 'second_col', 'third_col']
widths = [3, 5, 2]
header = OrderedDict(zip(names, widths))
pd.read_fwf(
'myfile.txt',
widths=header.values(),
names=header.keys()
)
first_col second_col third_col
0 ABC 123-3 YN

Filling out select control with python mechanize

I'm trying to fill out the registration for a website with python mechanize. Everything is going well but I can't figure out how to do the select controls. For example, if I'm picking my birthday month, here's the form that I need to fill out:
<SelectControl(mm=[*, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])>
I've viewed all the answers on stackoverflow already and they all seem to be some variation like this:
br.find_control(name="mm").value = ["0"]
or
form["mm"] = ["1"]
The problem here is that it gives me a error ItemNotFoundError: insufficient items with name '0'
item = br.find_control(name="mm" type="select").get("12")
item.selected = True
Nvm I just needed to do br.form['mm'] = ["1"] <--- I selected this but could have picked any of the values they allowed.
I have used all of the following:
br['mm'] = ['9']
br['mm'] = ['9',]
br.form['mm'] = ['9']
br.form['mm'] = ['9',]
I seem to remember one case where the comma was mandatory.