I have a huge dictionary (70 million keys) that is structured as such:
mydict[key] = [val1, val2, val3]
I need to put val1 in a separate variable if I find a positive match for val2, and I am trying to do this without a for loop to save time. Basically, I want to do this:
for key, val in mydict.items():
val1, val2, val3 = val
if val2 == 'awesome match':
new_var = val1
but I want to avoid the for loop.
Related
I'm new to scala and trying to concatenate two varying size list based on condition,
Below are the lists,
val check1:String = "NULL||BLANK||LENGTH"
val check2:String = "LENGTH||DUPLICATE"
val check3:String = "NUMERIC"
val checkLists = List(check1,check2,check3)
checkLists: List[String] = List(NULL||BLANK||LENGTH, LENGTH||DUPLICATE, NUMERIC)
val condList = List(">=2","<7")
I'm trying to concatenate checkLists & condList based on condition and create new list, whenever List contains String "LENGTH" it should concatenated with condList like below
List(NULL||BLANK||LENGTH~>=2, LENGTH~<7||DUPLICATE, NUMERIC)
I can able to use zip, foreach and case to concatenate of two equal size lists but here I'm facing trouble with different size lists.
Using zipAll will give the answer you are looking for:
checkLists.zipAll(condList, "", "").map {
case (check, cond) => check.replaceAll("LENGTH", "LENGTH~" + cond)
}
List(NULL||BLANK||LENGTH~>=2, LENGTH~<7||DUPLICATE, NUMERIC)
The missing element of condList is given as "", but a different default condition could be used if required.
Note that if the second LENGTH string is in the third element of checkLists rather than the second element, it will not get any condition. This may or may not be what is required.
I am quiet new to pig environment. I have tried to implement my pig script file in two ways.
I.
data = LOAD 'sample2.txt' USING PigStorage(',') as(campaign_id:chararray,date:chararray,time:chararray,display_site:chararray,placement:chararray,was_clicked:int,cpc:int,keyword:chararray);
distinct_data = DISTINCT data;
val = foreach distinct_data generate campaign_id,date,time,UPPER(keyword),display_site,placement,was_clicked,cpc;
val1 = foreach val generate campaign_id,date,time,TRIM(keyword),display_site,placement,was_clicked,cpc;
val2 = foreach val1 generate campaign_id,REPLACE(date, '-', '/'),time,keyword,display_site,placement,was_clicked,cpc;
dump val2;
i get error:
2016-09-29 02:45:40,826 INFO org.apache.pig.Main: Apache Pig version
0.10.0-cdh4.2.1 (rexported) compiled Apr 22 2013, 12:04:54 2016-09-29 02:45:40,827 INFO org.apache.pig.Main: Logging error messages to:
/home/training/training_materials/analyst/exercises/pig_etl/pig_1475131540824.log
2016-09-29 02:45:42,371 ERROR org.apache.pig.tools.grunt.Grunt: ERROR
1025: Invalid field
projection. Projected field [keyword] does not exist in schema:
campaign_id:chararray,date:chararray,time:chararray,org.apache.pig.builtin.upper_keyword_12:chararray,display_site:chararray,placement:chararray,was_clicked:int,cpc:int.
Details at logfile: /home/hduser/pig_etl/pig_1475131540824.log
But When i integrate the UPPER,TRIM and REPLACE in one statement then it works:
II.
data = LOAD 'sample2.txt' USING PigStorage(',') as(campaign_id:chararray,date:chararray,time:chararray,display_site:chararray,placement:chararray,was_clicked:int,cpc:int,keyword:chararray);
distinct_data = DISTINCT data;
val = foreach distinct_data generate campaign_id,REPLACE(date, '-', '/'),time,TRIM(UPPER(keyword)),display_site,placement,was_clicked,cpc;
dump val;
So, I just want someone to explain me that why I. method didn't work and what is the error message.
While you are applying TRIM in val1 there is nothing called "keyword" in val.
Note when you are applying any Function use alias so that error u can avoid..
or before creating a new relation it is always good to use describe so that the schema is clear to u..
Solution will be:
data = LOAD 'sample2.txt' USING PigStorage(',') as(campaign_id:chararray,date:chararray,time:chararray,display_site:chararray,placement:chararray,was_clicked:int,cpc:int,keyword:chararray);
distinct_data = DISTINCT data;
val = foreach distinct_data generate campaign_id,date,time,UPPER(keyword) as keyword,display_site,placement,was_clicked,cpc;
val1 = foreach val generate campaign_id,date,time,TRIM(keyword) as keyword,display_site,placement,was_clicked,cpc;
val2 = foreach val1 generate campaign_id,REPLACE(date, '-', '/') as date,time,keyword,display_site,placement,was_clicked,cpc;
dump val2;
Suppose I have a list of dictionaries, all of which have the same keys. An instance of such a dictionary in the list might look like:
dict = {"Height": 6.0, "Weight": 201.5, "Name": "John", "Status": "Married"}
Given only a few (key, value) pairs, I want to extract all dictionaries satisfying those pairs. For example, if I have
attributes = {"Height": 5.5, "Name": "John"}
I want to extract all dictionaries whose height value is greater than or equal to 5.5 AND whose name value is John.
I'm able to write the code that can satisfy one OR the other, but dealing with mixed types (float and string) is throwing me off, so my AND operator is being confused I guess. The problem part of my code, for example, is:
for option in attributes.keys():
if dict[option] == attributes[option] and dict[option] >= attributes[option]
print dict
If you have multiple different condition you have to do all of them, but instead of using and you can use all built-in function and a function for filtering the dictionaries, then use it within a list comprehension or filter function in order to get the expected dictionaries.
from operator import itemgetter
option1, option2, option3 = itemgetter(key1, key2, key3)(attributes)
def filter_func(temp_dict):
# key1, key2, key3 have been defined already (keys in attributes)
val1, val2, val3 = itemgetter(key1, key2, key3)(temp_dict)
return all(option1 == val1, option2 => val2, option3 => val3)
filtered_result = filter(filter_func, list_of_dictionaries)
Also note that if it's possible that the dictionaries with your list don't have all the specified keys, the itemgetter might raise an KeyError for getting ride of that you can use dict.get attribute by passing a default value to it (based on your need).
For example for val3 you can use temp_dict.get(key3, 0).
I have a prog that stores data in a sqlite db. Among other tables in the db, I have one created as follows:
conn.execute("CREATE TABLE {tn} ({cn} {ct})".format(tn=test, cn="STEP_NAME", ct="TEXT"))
Therein, the table creates has several columns. One is:
conn.execute("ALTER TABLE {tn} ADD COLUMN '{cn}' {ct} ".format(tn=test, cn=value, ct="TEXT"))
Im trying to save data to it, but it's behaving in a way I can't explain. When I save 270113185308874890 to it, it appears 270113185308874890 when recalled. However, when I save 89014103258771944209 to it, it saves as 8.90141032588e+19.
How can I prevent this? I've tried different column types with no luck and really don't understand why it's converting it.
EDIT:
Code that I'm using to store it
def store_result(conn, table_name, row_name, data):
for k, v in data.iteritems():
if isinstance(v, str):
data[k] = v.replace('"', "'").rstrip(' \t\r\n\0')
keys = data.keys()
vals = data.values()
# add test name column for everything but info call
if table_name != "info":
keys.insert(0, "STEP_NAME")
vals.insert(0, str(row_name))
# Make pretty for sqlite3 and its crazy param rules.
sql_keys = ','.join(str(v) for v in keys)
sql_vals = ','.join(str(v) for v in [x if str.isdigit(str(x)) else '"{}"'.format(x) for x in vals])
# try to write or tell me why not.
try:
conn.execute("""INSERT into {table}({sql_keys}) values ({vals})""".format(table=table_name,
sql_keys=sql_keys,
vals=sql_vals))
conn.commit()
except Exception as e:
logging.warn("DB ERROR:{}_{}_{}".format(e, table_name, row_name))
When you print the values after they are returned from the table, the type of the variable that holds the values affects both how they're printed and they're precision. As an example:
int1 = 270113185308874890;
float1 = 270113185308874890.0;
int2 = 89014103258771944209;
float2 = 89014103258771944209.0;
print 'int1 : ' + str(int1);
print 'float1: ' + str(float1);
print '';
print 'int2 : ' + str(int2);
print 'float2: ' + str(float2);
Will print:
int1 : 270113185308874890
float1: 2.70113185309e+17
int2 : 89014103258771944209
float2: 8.90141032588e+19
It seems likely that in the SQLite table the type is TEXT, as shown in the example from the SQLite website (https://www.sqlite.org/datatype3.html) below. You should use the typeof() function to ensure that you're data is being stored as TEXT.
Finally, you should consider using the INTEGER type rather than TEXT in your SQLite table if all of your numbers will be integers. Also if you are using TEXT to try and preserve precision, make sure you are not limited by the calling code. I.e. unless you are dealing with the Decimal Python type the REAL SQLite type will match the precision of the Float Python type.
2.3 Column Affinity Behavior Example
The following SQL demonstrates how SQLite uses column affinity to do
type conversions when values are inserted into a table.
CREATE TABLE t1(
t TEXT, -- text affinity by rule 2
nu NUMERIC, -- numeric affinity by rule 5
i INTEGER, -- integer affinity by rule 1
r REAL, -- real affinity by rule 4
no BLOB -- no affinity by rule 3 );
-- Values stored as TEXT, INTEGER, INTEGER, REAL, TEXT.
INSERT INTO t1 VALUES('500.0', '500.0', '500.0', '500.0', '500.0');
SELECT typeof(t), typeof(nu), typeof(i), typeof(r), typeof(no) FROM t1;
text|integer|integer|real|text
-- Values stored as TEXT, INTEGER, INTEGER, REAL, REAL.
DELETE FROM t1;
INSERT INTO t1 VALUES(500.0, 500.0, 500.0, 500.0, 500.0);
SELECT typeof(t), typeof(nu), typeof(i), typeof(r), typeof(no) FROM t1;
text|integer|integer|real|real
-- Values stored as TEXT, INTEGER, INTEGER, REAL, INTEGER.
DELETE FROM t1;
INSERT INTO t1 VALUES(500, 500, 500, 500, 500);
SELECT typeof(t), typeof(nu), typeof(i), typeof(r), typeof(no) FROM t1;
text|integer|integer|real|integer
I am getting queryset object which I want to convert it into only list.
[<yyyy : val1 val2 val3 val4 xxxx#xxx.com True True>, <yyyy : 2 None False True>]
I tried by
res.values_list()
But I am getting
[(val1,val2,val3,val4),(val11,val22,val33,val44)]
tuple(list(obj) for obj in res.values_list())