Python Script to query from DynamoDb not giving all items - amazon-web-services

I have written following python code to fetch data from a table but its not fetching all the items as I want. When I check on AWS console page of DynamoDb, I can see much more entries as compared to what I get from script.
from __future__ import print_function # Python 2/3 compatibility
import boto3
import json
import decimal
from datetime import datetime
from boto3.dynamodb.conditions import Key, Attr
import sys
# Helper class to convert a DynamoDB item to JSON.
class DecimalEncoder(json.JSONEncoder):
def default(self, o):
if isinstance(o, decimal.Decimal):
if o % 1 > 0:
return float(o)
else:
return int(o)
return super(DecimalEncoder, self).default(o)
dynamodb = boto3.resource('dynamodb', aws_access_key_id = '',
aws_secret_access_key = '',
region_name='eu-west-1', endpoint_url="http://dynamodb.eu-west-1.amazonaws.com")
mplaceId = int(sys.argv[1])
table = dynamodb.Table('XYZ')
response = table.query(
KeyConditionExpression=Key('mplaceId').eq(mplaceId)
)
print('Number of entries found ', len(response['Items']))
I did the same thing from aws console also. Query by mplaceId.
Any reason why its happening?

dynamodb.Table.query() returns at max 1MB of data. From the boto3 documentation:
A single Query operation will read up to the maximum number of items set (if using the Limit parameter) or a maximum of 1 MB of data and then apply any filtering to the results using FilterExpression. If LastEvaluatedKey is present in the response, you will need to paginate the result set. For more information, see Paginating the Results in the Amazon DynamoDB Developer Guide .
That's actually no boto3-limitation, but a limitation of the underlying query-API.
Instead of implementing pagination yourself, you can use boto3's built-in pagination . Here is an example showing the use of the paginator for querying DynamoDB tables provided by boto3:
import boto3
from boto3.dynamodb.conditions import Key
dynamodb_client = boto3.client('dynamodb')
paginator = dynamodb_client.get_paginator('query')
page_iterator = paginator.paginate(
TableName='XYZ',
KeyConditionExpression='mplaceId = :mplaceId',
ExpressionAttributeValues={':mplaceId': {'S' : mplaceid}}
)
for page in page_iterator:
print(page['Items'])

Related

Bigquery table to df (dataframe) in a cloud Function

I have a bigquery table - I would like to extraxt it into a pandas dataframe inside cloud function and then do some changes in the header file and later save it into Cloud storage. Unfortunately my function is not working, can anyone see what could be the issue. Do i need to use big query extract job or my idea is also valid?
import base64
import pandas as pd
from google.cloud import bigquery
def extract_partial_return(event, context):
client = bigquery.Client()
bucket_name = "abc_test"
project = "bq_project"
dataset_id = "bq_dataset"
table_id = "Partial_Return_Table"
sql = """
SELECT * FROM `bq_project.bq_dataset.Partial_Return_Table`
"""
# Running the query and putting the results directly into a df
df = client.query(sql).to_dataframe()
df.columns = ["ga:t_Id", "ga:product", "ga:quantity"]
destination_uri = (
"gs://abc_test/Exports/Partial_Return_Table.csv"
)
df.to_csv(destination_uri)
My requirement.txt looks like this
# Function dependencies, for example:
# package>=version
google-cloud-bigquery
pandas
pyarrow
pyarrow library is the key here
import base64
import pandas as pd
from google.cloud import bigquery
def extract_partial_return(event, context):
client = bigquery.Client()
sql = """
SELECT * FROM `bq_project.bq_dataset.Partial_Return_Table`
"""
# Running the query and putting the results directly into a df
df = client.query(sql).to_dataframe()
df.columns = ["ga:t_Id", "ga:product", "ga:quantity"]
destination_uri = ("gs://abc_test/Exports/Partial_Return_Table.csv")
df.to_csv(destination_uri)
requirement.txt
pandas
fsspec
gcsfs
google-cloud-bigquery
google-cloud-storage
pyarrow
First of all, the idea to do this is okay, however you may want to use other product such as Dataflow or Dataproc which are designed for these purposes.
On the other hand, in order to complete the idea that you have now, you should take care of the way you are constructing the SQL command because you are not using the variables created for the project, dataset, etc. The same issue happens on the bucket. Moreover, I think that you are lacking a couple of dependencies (fsspec and gcsfs).
Manuel

How can I save the output of an API call to DynamoDB?

I am calling a REST API from lambda function in AWS . I receive the response in the terminal. How can I store these API responses in DynamoDB. I am sorry if this may seem naive to some. I am a beginner and self learner.
Here is a full working example for a Python based Lambda function:
# Import AWS (Python version) SDK
import boto3
def lambda_handler(event, context):
# Construct the client to the DynamoDB service
client = boto3.client('dynamodb')
db = boto3.resource('dynamodb')
table_name = 'my-awesome-table'
# Table instance
table = db.Table(table_name)
# Example data to query an entry from the table
primary_column_name = 'email'
primary_key = 'john#gmail.com'
# We get an object for the retrieved table entry
table_entry = get_table_item(table, primary_column_name, primary_key)
# Example data to put into the table
new_entry = {
'email': 'jack#outlook.com',
'name': 'Jack Ma',
'age': 56
}
put_table_item(table, new_entry)
# Function to retrieve an entry from the DynamoDB table
def get_table_item(table, primary_column_name, primary_key):
response = table.get_item(
Key={
primary_column_name: primary_key
}
)
return response['Item']
# Function to put an entry into the DynamoDB table
def put_table_item(table, item):
response = table.put_item(
Item=item
)
# If there were errors, throw an exception
if (response['ResponseMetadata']['HTTPStatusCode'] != 200):
raise Exception('Failed to update table!')

boto2 dynamodb put_item always creates a new item

I am using boto 2.45 and Python 2.7. Here is the code:
import time
import boto
from boto.dynamodb2.layer1 import DynamoDBConnection
from boto.dynamodb2.table import Table
from boto.dynamodb2.fields import HashKey, RangeKey, KeysOnlyIndex,GlobalAllIndex
access_key = 'XXX'
secret_key = 'YYY'
conn = boto.dynamodb2.connect_to_region(
'us-east-2',
aws_access_key_id=access_key,
aws_secret_access_key=secret_key
)
table = Table('Table1', connection=conn);
table.put_item(data={
'name':'aaa.rar',
'time':int(time.time()),
'size':200000000000,
'title':'from boto'
})
Everytime I run this code, it always creates a new item in the table with name 'aaa.rar'. Table's primary key is name. This is not what I want. I want to raise an exception if an item with the same name already exists in the table.

Does boto3 have a credential cache comparable to awscli?

With awscli there's a credential cache in ~/.aws/cli/cache which allows me to cache credentials for a while. This is very helpful when using MFA. Does boto3 have a similar capability or do I have to explicitly cache my credentials returned from session = boto3.session.Session(profile_name='CTO:Admin')?
It is already there.
http://boto3.readthedocs.org/en/latest/guide/configuration.html#assume-role-provider
When you specify a profile that has IAM role configuration, boto3 will make an AssumeRole call to retrieve temporary credentials. Subsequent boto3 API calls will use the cached temporary credentials until they expire, in which case boto3 will automatically refresh credentials. boto3 does not write these temporary credentials to disk. This means that temporary credentials from the AssumeRole calls are only cached in memory within a single Session. All clients created from that session will share the same temporary credentials.
I created a Python library that provides this for you - see https://github.com/mixja/boto3-session-cache
Example:
import boto3_session_cache
# This returns a regular boto3 client object with the underlying session configured with local credential cache
client = boto3_session_cache.client('ecs')
ecs_clusters = client.list_clusters()
To summarise above points, a working example:
from os import path
import os
import sys
import json
import datetime
from distutils.spawn import find_executable
from botocore.exceptions import ProfileNotFound
import boto3
import botocore
def json_encoder(obj):
"""JSON encoder that formats datetimes as ISO8601 format."""
if isinstance(obj, datetime.datetime):
return obj.isoformat()
else:
return obj
class JSONFileCache(object):
"""JSON file cache.
This provides a dict like interface that stores JSON serializable
objects.
The objects are serialized to JSON and stored in a file. These
values can be retrieved at a later time.
"""
CACHE_DIR = path.expanduser(path.join('~', '.aws', 'ansible-ec2', 'cache'))
def __init__(self, working_dir=CACHE_DIR):
self._working_dir = working_dir
def __contains__(self, cache_key):
actual_key = self._convert_cache_key(cache_key)
return path.isfile(actual_key)
def __getitem__(self, cache_key):
"""Retrieve value from a cache key."""
actual_key = self._convert_cache_key(cache_key)
try:
with open(actual_key) as f:
return json.load(f)
except (OSError, ValueError, IOError):
raise KeyError(cache_key)
def __setitem__(self, cache_key, value):
full_key = self._convert_cache_key(cache_key)
try:
file_content = json.dumps(value, default=json_encoder)
except (TypeError, ValueError):
raise ValueError("Value cannot be cached, must be "
"JSON serializable: %s" % value)
if not path.isdir(self._working_dir):
os.makedirs(self._working_dir)
with os.fdopen(os.open(full_key,
os.O_WRONLY | os.O_CREAT, 0o600), 'w') as f:
f.truncate()
f.write(file_content)
def _convert_cache_key(self, cache_key):
full_path = path.join(self._working_dir, cache_key + '.json')
return full_path
session = boto3.session.Session()
try:
cred_chain = session._session.get_component('credential_provider')
except ProfileNotFound:
print "Invalid Profile"
sys.exit(1)
provider = cred_chain.get_provider('assume-role')
provider.cache = JSONFileCache()
# Do something with the session...
ec2 = session.resource('ec2')
Originally, the credential caching and automatic renewing of temporary credentials was part of the AWSCLI but this commit (and some subsequent ones) moved that functionality to botocore which means it is now available in boto3, as well.

Automating pulling csv files off google Trends

pyGTrends does not seem to work. Giving errors in Python.
pyGoogleTrendsCsvDownloader seems to work, logs in, but after getting 1-3 requests (per day!) complains about exhausted quota, even though manual download with the same login/IP works flawlessly.
Bottom line: neither work. Searching through stackoverflow: many questions from people trying to pull csv's from Google, but no workable solution I could find...
Thank you in advance: whoever will be able to help. How should the code be changed? Do you know of another solution that works?
Here's the code of pyGoogleTrendsCsvDownloader.py
import httplib
import urllib
import urllib2
import re
import csv
import lxml.etree as etree
import lxml.html as html
import traceback
import gzip
import random
import time
import sys
from cookielib import Cookie, CookieJar
from StringIO import StringIO
class pyGoogleTrendsCsvDownloader(object):
'''
Google Trends Downloader
Recommended usage:
from pyGoogleTrendsCsvDownloader import pyGoogleTrendsCsvDownloader
r = pyGoogleTrendsCsvDownloader(username, password)
r.get_csv(cat='0-958', geo='US-ME-500')
'''
def __init__(self, username, password):
'''
Provide login and password to be used to connect to Google Trends
All immutable system variables are also defined here
'''
# The amount of time (in secs) that the script should wait before making a request.
# This can be used to throttle the downloading speed to avoid hitting servers too hard.
# It is further randomized.
self.download_delay = 0.25
self.service = "trendspro"
self.url_service = "http://www.google.com/trends/"
self.url_download = self.url_service + "trendsReport?"
self.login_params = {}
# These headers are necessary, otherwise Google will flag the request at your account level
self.headers = [('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0'),
("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"),
("Accept-Language", "en-gb,en;q=0.5"),
("Accept-Encoding", "gzip, deflate"),
("Connection", "keep-alive")]
self.url_login = 'https://accounts.google.com/ServiceLogin?service='+self.service+'&passive=1209600&continue='+self.url_service+'&followup='+self.url_service
self.url_authenticate = 'https://accounts.google.com/accounts/ServiceLoginAuth'
self.header_dictionary = {}
self._authenticate(username, password)
def _authenticate(self, username, password):
'''
Authenticate to Google:
1 - make a GET request to the Login webpage so we can get the login form
2 - make a POST request with email, password and login form input values
'''
# Make sure we get CSV results in English
ck = Cookie(version=0, name='I4SUserLocale', value='en_US', port=None, port_specified=False, domain='www.google.com', domain_specified=False,domain_initial_dot=False, path='/trends', path_specified=True, secure=False, expires=None, discard=False, comment=None, comment_url=None, rest=None)
self.cj = CookieJar()
self.cj.set_cookie(ck)
self.opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(self.cj))
self.opener.addheaders = self.headers
# Get all of the login form input values
find_inputs = etree.XPath("//form[#id='gaia_loginform']//input")
try:
#
resp = self.opener.open(self.url_login)
if resp.info().get('Content-Encoding') == 'gzip':
buf = StringIO( resp.read())
f = gzip.GzipFile(fileobj=buf)
data = f.read()
else:
data = resp.read()
xmlTree = etree.fromstring(data, parser=html.HTMLParser(recover=True, remove_comments=True))
for input in find_inputs(xmlTree):
name = input.get('name')
if name:
name = name.encode('utf8')
value = input.get('value', '').encode('utf8')
self.login_params[name] = value
except:
print("Exception while parsing: %s\n" % traceback.format_exc())
self.login_params["Email"] = username
self.login_params["Passwd"] = password
params = urllib.urlencode(self.login_params)
self.opener.open(self.url_authenticate, params)
def get_csv(self, throttle=False, **kwargs):
'''
Download CSV reports
'''
# Randomized download delay
if throttle:
r = random.uniform(0.5 * self.download_delay, 1.5 * self.download_delay)
time.sleep(r)
params = {
'export': 1
}
params.update(kwargs)
params = urllib.urlencode(params)
r = self.opener.open(self.url_download + params)
# Make sure everything is working ;)
if not r.info().has_key('Content-Disposition'):
print "You've exceeded your quota. Continue tomorrow..."
sys.exit(0)
if r.info().get('Content-Encoding') == 'gzip':
buf = StringIO( r.read())
f = gzip.GzipFile(fileobj=buf)
data = f.read()
else:
data = r.read()
myFile = open('trends_%s.csv' % '_'.join(['%s-%s' % (key, value) for (key, value) in kwargs.items()]), 'w')
myFile.write(data)
myFile.close()
Although I don't know python, I may have a solution. I am currently doing the same thing in C# and though I didn't get the .csv file, I got created a custom URL through code and then downloaded that HTML and saved to a text file (also through code). In this HTML (at line 12) is all the information needed to create the graph that is used on Google Trends. However, this has alot of unnecessary text within it that needs to be cut down. But either way, you end up with the same result. The Google Trends data. I posted a more detailed answer to my question here:
Downloading .csv file from Google Trends
There is an alternative module named pytrends - https://pypi.org/project/pytrends/ It is really cool. I would recommend this.
Example usage:
import numpy as np
import pandas as pd
from pytrends.request import TrendReq
pytrend = TrendReq()
#It is the term that you want to search
pytrend.build_payload(kw_list=["Eminem is the Rap God"])
# Find which region has searched the term
df = pytrend.interest_by_region()
df.to_csv("path\Eminem_InterestbyRegion.csv")
Potentially if you have a list of terms to search you could make use of "for loop" to automate the insights as per your wish.