python dictionaries, want to do a list of lists - list

I want to make a connect 4 game, but it involves having lists of lists. Let's say if there is a 'counter' in column 1, then I would have to add 1 to the list name (line1 -> line2). I've used dictionaries but I just end up having tuple errors and unhashable type list errors.
Here's what I've got:
col1 = 0
col2 = 0
col3 = 0
col4 = 0
col5 = 0
col6 = 0
col7 = 0
line7 = [0,0,0,0,0,0,0]
line6 = [0,0,0,0,0,0,0]
line5 = [0,0,0,0,0,0,0]
line4 = [0,0,0,0,0,0,0]
line3 = [0,0,0,0,0,0,0]
line2 = [0,0,0,0,0,0,0]
line1 = [0,0,0,0,0,0,0]
alllist = {
line1,
line2,
line3,
line4,
line5,
line6,
line7,
}
a1 = int(input("What column do you want to place your counter on? "))
line1[(a1-1)] = (1)
if line1[0] == (1):
col1 += 1
b1 = randrange(1,7)
b2 = random.choice(line1,line2,line3,line4,line5,line6,line7)
if b1 == col1:
alllist[(0)+1] = (2)
One error message:
TypeError: unhashable type: 'list'
Any help would be appreciated!

Any help would be appreciated!
If you want maintainable code / code other can be bothered to try and understand, comment it properly (describe why/what-for everything is there (where that isn't glaringly obvious)).
For python&hashability, have a look at the differences between tuples and lists.

Related

Merging Tables in Apache Arrow

I have two arrow:Tables where table 1 is:
colA colB
1 2
3 4
and table 2 is,
colC colD
i j
k l
where both table 1 and 2 have the same number of rows. I would like to join them side-by-side as
colA colB colC coldD
1 2 i j
3 4 k l
I'm trying to use arrow::ConcatenateTables as follows, but I'm getting a bunch of nulls in my output (not shown)
t1 = ... \\ std::shared_ptr<arrow::Table>
t2 = ... \\ std::shared_ptr<arrow::Table>
arrow::ConcatenateTablesOptions options;
options.unify_schemas = true;
options.field_merge_options.promote_nullability = true;
auto merged = arrow::ConcatenateTables({t1, t2}, options);
How do I obtain the expected output?
arrow::ConcatenateTables only does row-wise concatenation. There is no builtin helper method for column-wise concatenation but it is easy enough to create one yourself (apologies if this is not quite right, I'm not in front of a compiler at the moment):
std::shared_ptr<arrow::Table> CombineTables(const Table& left, const Table& right) {
std::vector<std::shared_ptr<arrow::ChunkedArray>> columns = left.columns();
const std::vector<std::shared_ptr<arrow::ChunkedArray>>& right_columns = right.columns();
columns.insert(columns.end(), right_columns.begin(), right_columns.end());
std::vector<std::shared_ptr<arrow::Field>> fields = left.fields();
const std::vector<std::shared_ptr<arrow::Field>>& right_fields = right.fields();
fields.insert(fields.end(), right_fields.begin(), right_fields.end());
return arrow::Table::Make(arrow::schema(std::move(fields)), std::move(columns));
}

Merge Series of a pandas column (which is a Series itself) in groups

I have a pandas data frame in which one of the columns is a Series itself. Eg:
df.head()
Col1 Col2
1 ["name1","name2","name3"]
1 ["name3","name2","name4"]
2 ["name1","name2","name3"]
2 ["name1","name5","name6"]
I need to concatenate the Col2 in groups of Col1. I want something like
Col1 Col2
1 ["name1","name2","name3","name4"]
2 ["name1","name2","name3","name5","name6"]
I tries using a groupby as
.agg({"Col2":lambda x: pd.Series.append(x)})
But this throws error saying two parameters are required. I also tried using sum in the agg function. That fails with error does not reduce.
How do I do this?
You can use groupby with apply custom function, where first flatten nested lists by chain (fastest solution), then remove duplicates by set, convert to list and last sort:
import pandas as pd
from itertools import chain
df = pd.DataFrame({'Col1':[1,1,2,2],
'Col2':[["name1","name2","name3"],
["name3","name2","name4"],
["name1","name2","name3"],
["name1","name5","name6"]]})
print (df)
Col1 Col2
0 1 [name1, name2, name3]
1 1 [name3, name2, name4]
2 2 [name1, name2, name3]
3 2 [name1, name5, name6]
print (df.groupby('Col1')['Col2']
.apply(lambda x: sorted(list(set(list(chain.from_iterable(x))))))
.reset_index())
Col1 Col2
0 1 [name1, name2, name3, name4]
1 2 [name1, name2, name3, name5, name6]
Solution can be more simplier, only chain, set and sorted is necessary:
print (df.groupby('Col1')['Col2']
.apply(lambda x: sorted(set(chain.from_iterable(x))))
.reset_index())
Col1 Col2
0 1 [name1, name2, name3, name4]
1 2 [name1, name2, name3, name5, name6]
Yea, you wouldn't be able to use .aggby{} on categorical data like this. Anyway, here's my stab at the problem, using the help up of numpy. (commented for clarity)
import numpy as np
# Set group by ("Col1") unique values
groupby = df["Col1"].unique()
# Create empty dict to store values on each iteration
d = {}
for i,val in enumerate(groupby):
# Set "Col1" key, to the unique value (e.g., 1)
d.setdefault("Col1",[]).append(val)
# Create empty list to store values from "Col2"
col2_unis=[]
# Create sub-DataFrame for each unique groupby value
sdf = df.loc[df["Col1"]==val]
# Loop through the 2D-array/Series "Col2" and append each
# value to col_unis (using list comprehension)
col2_unis.append([[j for j in array] for i,array in enumerate(sdf["Col2"].values)])
# Set "Col2" key, to be unique values of col2_unis
d.setdefault("Col2",[]).append(np.unique(col2_unis))
new_df = pd.DataFrame(d)
print(new_df)
A more condensed version would look like:
d = {}
for i,val in enumerate(df["Col1"].unique()):
d.setdefault("Col1",[]).append(val)
sdf = df.loc[df["Col1"]==val]
d.setdefault("Col2",[]).append(np.unique([[j for j in array] for i,array in enumerate(df.loc[df["Col1"]==val, "Col2"].values)]))
new_df = pd.DataFrame(d)
print(new_df)
Learn more about Python's .setdefault() function for dictionaries, by checking out this related SO question.

Adding a new column based on values

I have the following sample data:
data weight_club;
input IdNumber 1-4 Name $ 6-24 Team $ StartWeight EndWeight;
Loss = StartWeight - EndWeight;
datalines;
1023 David Shaw red 189 165
1049 Amelia Serrano yellow 145 124
1219 Alan Nance purple 210 192
1246 Ravi Sinha yellow 194 177
1078 Ashley McKnight green 127 118
;
What I would like to do now is the following:
Create two lists with colours (fe, list1 = "red" and "yellow" and list2 = "purple" and "green")
Classify the records according to whether or not they are in list1 and list2 and add a new column.
So the pseudo code is like this:
'Set new category called class
If item is in list1 then class = 1
Else if item is in list2 then class = 2
Else class = 3
Any thoughts on how I can do this most effciently?
Your pseudocode is almost exactly it.
If item is in ('red' 'yellow') then class = 1;
Else if item is in ('purple' 'green') then class = 2;
Else class = 3;
This is really a lookup, so their are many other methods. One I usually recommend as well is Proc format, though in a simplistic case like this I'm not sure of any gains.
Proc format;
Value $ colour_cat
'red', 'yellow' = 1
'purple', 'green' = 2
Other = 3;
Run;
And then in a data/SQL either of the following can be used.
*actual conversion;
Category = put(colour, $colour_cat.);
* change display only;
Format colour $colour_cat.;

extra commas when using read_csv causing too many "s in data frame

I'm trying to read in a large file (~8Gb) using pandas read_csv. In one of the columns in the data, there is sometimes a list which includes commas but it enclosed by curly brackets e.g.
"label1","label2","label3","label4","label5"
"{A1}","2","","False","{ "apple" : false, "pear" : false, "banana" : null}
Therefore, when these particular lines were read in I was getting the error "Error tokenizing data. C error: Expected 37 fields in line 35, saw 42". I found this solution which said to add
sep=",(?![^{]*})" into the read_csv arguments which worked with splitting the data correctly. However, the data now includes the quotation marks around every entry (this didn't happen before I added the sep argument in).
The data looks something like this now:
"label1" "label2" "label3" "label4" "label5"
"{A1}" "2" "" "False" "{ "apple" : false, "pear" : false, "banana" : null}"
meaning I can't use, for example, .describe(), etc on the numerical data because they're still strings.
Does anyone know of a way of reading it in without the quotation marks but still splitting the data where it is?
Very new to Python so apologies if there is an obvious solution.
serialdev found a solution to removing the "s but the data columns are objects and not what I would expect/want, e.g. the integer values aren't seen as integers.
The data needs to be split at "," explicitly (including the "s), is there a way of stating that in the read_csv arguments?
Thanks!
To read in the data structure you specified, where the last element is an unknown length.
"{A1}","2","","False","{ "apple" : false, "pear" : false, "banana" : null}"
"{A1}","2","","False","{ "apple" : false, "pear" : false, "banana" : null, "orange": "true"}"
Change the separate to a regular expression using a negative forward lookahead assertion. This will enable you to separate on a ',' only when not immediately followed by a space.
df = pd.read_csv('my_file.csv', sep='[,](?!\s)', engine='python', thousands='"')
print df
0 1 2 3 4
0 "{A1}" 2 NaN "False" "{ "apple" : false, "pear" : false, "banana" :...
1 "{A1}" 2 NaN "False" "{ "apple" : false, "pear" : false, "banana" :...
Specifying the thousands separator as the quote is a bit of a hackie way to parse fields contains a quoted integer into the correct datatype. You can achieve the same result using converters which can also remove the quotes from the strings should you need it to and cast "True" or "False" to a boolean.
If need remove " from column, use vectorized function str.strip:
import pandas as pd
mydata = [{'"first_name"': '"Bill"', '"age"': '"7"'},
{'"first_name"': '"Bob"', '"age"': '"8"'},
{'"first_name"': '"Ben"', '"age"': '"9"'}]
df = pd.DataFrame(mydata)
print (df)
"age" "first_name"
0 "7" "Bill"
1 "8" "Bob"
2 "9" "Ben"
df['"first_name"'] = df['"first_name"'].str.strip('"')
print (df)
"age" "first_name"
0 "7" Bill
1 "8" Bob
2 "9" Ben
If need apply function str.strip() to all columns, use:
df = pd.concat([df[col].str.strip('"') for col in df], axis=1)
df.columns = df.columns.str.strip('"')
print (df)
age first_name
0 7 Bill
1 8 Bob
2 9 Ben
Timings:
mydata = [{'"first_name"': '"Bill"', '"age"': '"7"'},
{'"first_name"': '"Bob"', '"age"': '"8"'},
{'"first_name"': '"Ben"', '"age"': '"9"'}]
df = pd.DataFrame(mydata)
df = pd.concat([df]*3, axis=1)
df.columns = ['"first_name1"','"age1"','"first_name2"','"age2"','"first_name3"','"age3"']
#create sample [300000 rows x 6 columns]
df = pd.concat([df]*100000).reset_index(drop=True)
df1,df2 = df.copy(),df.copy()
def a(df):
df.columns = df.columns.str.strip('"')
df['age1'] = df['age1'].str.strip('"')
df['first_name1'] = df['first_name1'].str.strip('"')
df['age2'] = df['age2'].str.strip('"')
df['first_name2'] = df['first_name2'].str.strip('"')
df['age3'] = df['age3'].str.strip('"')
df['first_name3'] = df['first_name3'].str.strip('"')
return df
def b(df):
#apply str function to all columns in dataframe
df = pd.concat([df[col].str.strip('"') for col in df], axis=1)
df.columns = df.columns.str.strip('"')
return df
def c(df):
#apply str function to all columns in dataframe
df = df.applymap(lambda x: x.lstrip('\"').rstrip('\"'))
df.columns = df.columns.str.strip('"')
return df
print (a(df))
print (b(df1))
print (c(df2))
In [135]: %timeit (a(df))
1 loop, best of 3: 635 ms per loop
In [136]: %timeit (b(df1))
1 loop, best of 3: 728 ms per loop
In [137]: %timeit (c(df2))
1 loop, best of 3: 1.21 s per loop
Would this work since you have all the data that you need:
.map(lambda x: x.lstrip('\"').rstrip('\"'))
So simply clean up all the occurrences of " afterwards
EDIT with example:
mydata = [{'"first_name"' : '"bill', 'age': '"75"'},
{'"first_name"' : '"bob', 'age': '"7"'},
{'"first_name"' : '"ben', 'age': '"77"'}]
IN: df = pd.DataFrame(mydata)
OUT:
"first_name" age
0 "bill "75"
1 "bob "7"
2 "ben "77"
IN: df['"first_name"'] = df['"first_name"'].map(lambda x: x.lstrip('\"').rstrip('\"'))
OUT:
0 bill
1 bob
2 ben
Name: "first_name", dtype: object
Use this sequence after selecting the column, it is not ideal but will get the job done:
.map(lambda x: x.lstrip('\"').rstrip('\"'))
You can change the Dtypes after using this pattern:
df['col'].apply(lambda x: pd.to_numeric(x, errors='ignore'))
or simply:
df[['col2','col3']] = df[['col2','col3']].apply(pd.to_numeric)
It depend on your file. Did you check your data if there is comma or not, in cell ? If you have like this e.g Banana : Fruit, Tropical, Eatable, etc. in same cell, you're gonna get this kind of bug. One of basic solution is removing all commas in a file. Or, if you can read it, you can remove special characters :
>>>df
Banana
0 Hello, Salut, Salom
1 Bonjour
>>>df['Banana'] = df['Banana'].str.replace(',','')
>>>df
Banana
0 Hello Salut Salom
1 Bonjour

In R, how can I insert a TRUE / FALSE column if strings in columns ARE / ARE NOT alphabetic?

Sample data:
df <- data.frame(noun1 = c("cat","dog"), noun2 = c("apple", "tree"))
noun1 noun2
1 cat apple
2 dog tree
How can I make a new column df$alpha that would read FALSE in row 1 and TRUE in row 2?
Thank you!
I think you can just apply is.unsorted() to each row, although you have to unlist it first (probably).
df <- data.frame(noun1 = c("cat","dog"), noun2 = c("apple", "tree"))
df$alpha <- apply(df,1,function(x) !is.unsorted(unlist(x)))
I found is.unsorted() via apropos("sort").