Timestamps showing as 1970 in QuestDB when writing from node.js - questdb

I have a table that looks like:
timestamp
sensor
reading
ts1
sensor name
123
The main part of my script where I'm writing data looks like the following:
const run = async () => {
try {
const client = new Client({
database: "qdb",
...
})
await client.connect()
// read sensors
...
const insertData = await client.query(
"INSERT INTO measurements VALUES($1, $2, $3);",
[Date.now(), sensor_id, sensor_reading],
)
await client.query("COMMIT")
await client.end()
} catch (e) {
console.log(e)
}
}
run()
The sensor ID and measurements are looking fine, but the timestamps are all 1970, what's wrong with the date insert?

QuestDB has microsecond resolution timestamp natively. Date.now() creates a millisecond-resolution timestamp.
If you don't mind the loss of precision, you can just multiply by 1000:
const insertData = await client.query(
"INSERT INTO measurements VALUES($1, $2, $3);",
[Date.now() * 1000, sensor_id, sensor_reading],
)
If you have high-throughput or a need for accuracy, you can use the microtime package:
https://www.npmjs.com/package/microtime

Related

how likely is it to miss a solidity event?

Hi i am making a finance application that depends on Events from blockchain,
Basically i update my database based on Events i receive using web3js, and when user asks i sign with private key for the contract to be able to give user Money.
My only concern is can i depend on events? like can there be a case where i miss events?
here is my code for doing so :
const contract = new web3.eth.Contract(abi, contract_address)
const stale_keccak256 = "0x507ac39eb33610191cd8fd54286e91c5cc464c262861643be3978f5a9f18ab02";
const unStake_keccak256 = "0x4ac743692c9ced0a3f0052fb9917c0856b6b12671016afe41b649643a89b1ad5";
const getReward_keccak256 = "0x25c30c62c42b51e4f667b70ef60f1f683c376f6ace28312ed45a40665e01af37";
let userRepository: Repository<UserData> = connection.getRepository(UserData);
let globalRepository: Repository<GlobalStakingInfo> = connection.getRepository(GlobalStakingInfo);
let userStakingRepository: Repository<UserStakingInfo> = connection.getRepository(UserStakingInfo);
let transactionRepository: Repository<Transaction> = connection.getRepository(Transaction);
const topics = []
web3.eth.subscribe('logs', {
address: contract_address, topics: topics
},
function (error: Error, result: Log) {
if (error) console.log(error)
}).on("data", async function (log: Log) {
let response: Boolean = false;
try {
response = await SaveTransaction(rpc_url, log.address, log.transactionHash, transactionRepository)
} catch (e) {
}
if (response) {
try {
let global_instance: GlobalStakingInfo | null = await globalRepository.findOne({where: {id: 1}})
if (!global_instance) {
global_instance = new GlobalStakingInfo()
global_instance.id = 1;
global_instance = await globalRepository.save(global_instance);
}
if (log.topics[0] === stale_keccak256) {
await onStake(web3, log, contract, userRepository, globalRepository, userStakingRepository, global_instance);
} else if (log.topics[0] === unStake_keccak256) {
await onUnStake(web3, log, contract, userStakingRepository, userRepository, globalRepository, global_instance)
} else if (log.topics[0] === getReward_keccak256) {
await onGetReward(web3, log, userRepository)
}
} catch (e) {
console.log("I MADE A BOBO", e)
}
}
}
)
The Code works and everything, i am just concerned if i could maybe miss a event? cause finance is related and people will lose money if missing event is a thing.
Please advise
You can increase redundancy by adding more instances of the listener connected to other nodes.
And also by polling past logs - again recommended to use a separate node.
Having multiple instances doing practically the same thing will result in multiplication of incoming data, so don't forget to store only unique logs. This might be a bit tricky, because theoretically one transaction ID can produce the same log twice (e.g. through a multicall), so the unique key could be a combination of a transaction ID as well as the log index (unique per block).

Use Xray sdk in Lambda to send segments over UDP

I am using Lambda. I want to send a subsegment to Xray with an custom end_time. Xray is enabled in my Lambda.
When I use the aws-xray-sdk-core and addNewSubsegment('postgres') I do not find the possibility to add an end_time. It looks like the end_time is being set when you close() the Segment.
To try and solve this limitation I base myself on the following to send a custom segment to the Xray Daemon using UDP.
Use UDP to send segment to XRay
Below code is not sending a SubSegment to Xray. I am not receiving any errors when sending the segment with client.send(...).
Does someone knows more about this limitation of setting an custom end_time/ knows if it's possible with UDP inside a Lambda?
import AWSXRay from 'aws-xray-sdk-core'
const traceDocument = {
trace_id,
id: generateRandomHex(16),
name: 'postgres',
start_time,
end_time,
type: 'subsegment',
parent_id,
sql: { sanitized_query: query },
}
const client = createSocket('udp4')
const udpSegment = `{"format": "json", "version": 1}\n${traceDocument}`
client.send(udpSegment, 2000, '127.0.0.1', (e, b) => {
console.log(e, b)
})
Managed to find out the solution myself
used the X-Ray SDK with a combination of
addAttribute('in_progress', false) and
streamSubsegments() to send the subsegments to X-Ray
export const onQueryEvent = async (e) => {
try {
const segment = AWSXRay.getSegment()
if (segment) {
const paramsArr = JSON.parse(e.params)
const query = getQueryWithParams(e.query, paramsArr) // X-Ray wants the time in seconds -> ms * 1e-3
const start_time = e.timestamp.valueOf() * 1e-3
const end_time = (e.timestamp.valueOf() + e.duration) * 1e-3
// Add a new Subsegment to parent Segment
const subSegment = segment.addNewSubsegment('postgres') // Add data to the segment
subSegment.addSqlData({ sanitized_query: query })
subSegment.addAttribute('start_time', start_time)
subSegment.addAttribute('end_time', end_time) // Set in_progress to false so subSegment
// will be send to xray on streamSubsegments()
subSegment.addAttribute('in_progress', false)
subSegment.streamSubsegments()
}
} catch (e) {
console.log(e)
}
}

Trying to use array method "fold" on a Future<List> to get total price of items

In Dart/Flutter I am building a model where I get all products prom a remote endpoint using a Future and async await.
When I have retrieved the list I want to specify a property that returns the total amount of all item using the "fold" method of Dart lists.
Something like this:
Future<List<Product>> get items async => // here i get products
await Future.wait(_itemIds.map((id) => api.getProduct(id)));
Future<int> get totalPrice async => // here i calculate products total amount
await items.then((iii) => iii.fold(0, (total, current) async {
return total + current.price;
})
);
But I get an error:
The operator '+' isn't defined for the class 'FutureOr'. Try
defining the operator '+'.dart(undefined_operator).
How am I supposed to solve this problem in an async language?
Thank you
The problem with your code is that await doesn't apply to items but on all the expression items.then(....
The following code should work:
Future<List<Product>> get items async => // here i get products
await Future.wait(_itemIds.map((id) => api.getProduct(id)));
Future<int> get totalPrice async { // here i calculate products total amount
final products = await items;
return products.fold<int>(0, (total, current) => total + current.price);
};

How to fix "syntax error: unexpected token at position" with attempting to stream data to bigquery?

I am trying to following this codelab and I getting a SyntaxError when I get step 7.
SyntaxError: Unexpected token ' in JSON at position 1 at JSON.parse (<anonymous>) at exports.subscribe (/srv/index.js:9:26) at /worker/worker.js:825:24 at <anonymous> at process._tickDomainCallback (internal/process/next_tick.js:229:7)
I tried to edit the const incoming json line of the code below and I am still getting the error.
exports.subscribe = function (event, callback) {
const BigQuery = require('#google-cloud/bigquery');
const projectId = "iot2analytics-240915"; //Enter your project ID here
const datasetId = "weatherData"; //Enter your BigQuery dataset name here
const tableId = "weatherDataTable"; //Enter your BigQuery table name here -- make sure it is setup correctly
const PubSubMessage = event.data;
// Incoming data is in JSON format
const incomingData = PubSubMessage.data ? Buffer.from(PubSubMessage.data, 'base64').toString() : "{'sensorID':'na','timecollected':'1/1/1970 00:00:00','zipcode':'00000','latitude':'0.0','longitude':'0.0','temperature':'-273','humidity':'-1','dewpoint':'-273','pressure':'0'}";
const jsonData = JSON.parse(incomingData);
var rows = [jsonData];
console.log(`Uploading data: ${JSON.stringify(rows)}`);
// Instantiates a client
const bigquery = BigQuery({
projectId: projectId
});
// Inserts data into a table
bigquery
.dataset(datasetId)
.table(tableId)
.insert(rows)
.then((foundErrors) => {
rows.forEach((row) => console.log('Inserted: ', row));
if (foundErrors && foundErrors.insertErrors != undefined) {
foundErrors.forEach((err) => {
console.log('Error: ', err);
})
}
})
.catch((err) => {
console.error('ERROR:', err);
});
// [END bigquery_insert_stream]
callback();
};
and here is the package json
{
"name": "function-weatherPubSubToBQ-1",
"version": "0.0.1",
"private": true,
"license": "Apache-2.0",
"author": "Google Inc.",
"dependencies": {
"#google-cloud/bigquery": "^0.9.6"
}
}
I see on my raspberry pi the data is being collected from the sensor, but I get the error every time it tries to insert into bigquery.
Any suggestions or help would be greatly appreciated.
In your example, you coded the JSON as:
{
'sensorID':'na',
'timecollected':'1/1/1970 00:00:00',
'zipcode':'00000',
'latitude':'0.0',
'longitude':'0.0',
'temperature':'-273',
'humidity':'-1',
'dewpoint':'-273',
'pressure':'0'
}
If we look at the JSON spec we find that strings must be enclosed in double quotes and not single quotes. Replace your JSON with:
{
"sensorID":"na",
"timecollected":"1/1/1970 00:00:00",
"zipcode":"00000",
"latitude":'0.0",
"longitude":"0.0",
"temperature":"-273",
"humidity":"-1",
"dewpoint":"-273",
"pressure":"0"
}

How can I import bulk data from a CSV file into DynamoDB?

I am trying to import a CSV file data into AWS DynamoDB.
Here's what my CSV file looks like:
first_name last_name
sri ram
Rahul Dravid
JetPay Underwriter
Anil Kumar Gurram
In which language do you want to import the data? I just wrote a function in Node.js that can import a CSV file into a DynamoDB table. It first parses the whole CSV into an array, splits array into (25) chunks and then batchWriteItem into table.
Note: DynamoDB only allows writing up to 25 records at a time in batchinsert. So we have to split our array into chunks.
var fs = require('fs');
var parse = require('csv-parse');
var async = require('async');
var csv_filename = "YOUR_CSV_FILENAME_WITH_ABSOLUTE_PATH";
rs = fs.createReadStream(csv_filename);
parser = parse({
columns : true,
delimiter : ','
}, function(err, data) {
var split_arrays = [], size = 25;
while (data.length > 0) {
split_arrays.push(data.splice(0, size));
}
data_imported = false;
chunk_no = 1;
async.each(split_arrays, function(item_data, callback) {
ddb.batchWriteItem({
"TABLE_NAME" : item_data
}, {}, function(err, res, cap) {
console.log('done going next');
if (err == null) {
console.log('Success chunk #' + chunk_no);
data_imported = true;
} else {
console.log(err);
console.log('Fail chunk #' + chunk_no);
data_imported = false;
}
chunk_no++;
callback();
});
}, function() {
// run after loops
console.log('all data imported....');
});
});
rs.pipe(parser);
Updated 2019 Javascript code
I didn't have much luck with any of the Javascript code samples above. Starting with Hassan Siddique answer above, I've updated to the latest API, included sample credential code, moved all user config to the top, added uuid()'s when missing and stripped out blank strings.
const fs = require('fs');
const parse = require('csv-parse');
const async = require('async');
const uuid = require('uuid/v4');
const AWS = require('aws-sdk');
// --- start user config ---
const AWS_CREDENTIALS_PROFILE = 'serverless-admin';
const CSV_FILENAME = "./majou.csv";
const DYNAMODB_REGION = 'eu-central-1';
const DYNAMODB_TABLENAME = 'entriesTable';
// --- end user config ---
const credentials = new AWS.SharedIniFileCredentials({
profile: AWS_CREDENTIALS_PROFILE
});
AWS.config.credentials = credentials;
const docClient = new AWS.DynamoDB.DocumentClient({
region: DYNAMODB_REGION
});
const rs = fs.createReadStream(CSV_FILENAME);
const parser = parse({
columns: true,
delimiter: ','
}, function(err, data) {
var split_arrays = [],
size = 25;
while (data.length > 0) {
split_arrays.push(data.splice(0, size));
}
data_imported = false;
chunk_no = 1;
async.each(split_arrays, function(item_data, callback) {
const params = {
RequestItems: {}
};
params.RequestItems[DYNAMODB_TABLENAME] = [];
item_data.forEach(item => {
for (key of Object.keys(item)) {
// An AttributeValue may not contain an empty string
if (item[key] === '')
delete item[key];
}
params.RequestItems[DYNAMODB_TABLENAME].push({
PutRequest: {
Item: {
id: uuid(),
...item
}
}
});
});
docClient.batchWrite(params, function(err, res, cap) {
console.log('done going next');
if (err == null) {
console.log('Success chunk #' + chunk_no);
data_imported = true;
} else {
console.log(err);
console.log('Fail chunk #' + chunk_no);
data_imported = false;
}
chunk_no++;
callback();
});
}, function() {
// run after loops
console.log('all data imported....');
});
});
rs.pipe(parser);
I've created a gem for this.
Now you can install it by running gem install dynamocli, then you can use the command:
dynamocli import your_data.csv --to your_table
Here is the link to the source code: https://github.com/matheussilvasantos/dynamocli
As a lowly dev without perms to create a Data Pipeline, I had to use this javascript. Hassan Sidique's code was slightly out of date, but this worked for me:
var fs = require('fs');
var parse = require('csv-parse');
var async = require('async');
const AWS = require('aws-sdk');
const dynamodbDocClient = new AWS.DynamoDB({ region: "eu-west-1" });
var csv_filename = "./CSV.csv";
rs = fs.createReadStream(csv_filename);
parser = parse({
columns : true,
delimiter : ','
}, function(err, data) {
var split_arrays = [], size = 25;
while (data.length > 0) {
//split_arrays.push(data.splice(0, size));
let cur25 = data.splice(0, size)
let item_data = []
for (var i = cur25.length - 1; i >= 0; i--) {
const this_item = {
"PutRequest" : {
"Item": {
// your column names here will vary, but you'll need do define the type
"Title": {
"S": cur25[i].Title
},
"Col2": {
"N": cur25[i].Col2
},
"Col3": {
"N": cur25[i].Col3
}
}
}
};
item_data.push(this_item)
}
split_arrays.push(item_data);
}
data_imported = false;
chunk_no = 1;
async.each(split_arrays, (item_data, callback) => {
const params = {
RequestItems: {
"tagPerformance" : item_data
}
}
dynamodbDocClient.batchWriteItem(params, function(err, res, cap) {
if (err === null) {
console.log('Success chunk #' + chunk_no);
data_imported = true;
} else {
console.log(err);
console.log('Fail chunk #' + chunk_no);
data_imported = false;
}
chunk_no++;
callback();
});
}, () => {
// run after loops
console.log('all data imported....');
});
});
rs.pipe(parser);
You can use AWS Data Pipeline which is for things like this. You can upload your csv file to S3 and then use Data Pipeline to retrieve and populate a DynamoDB table. They have a step-by-step tutorial.
I wrote a tool to do this using parallel execution that requires no dependencies or developer tooling installed on the machine (it's written in Go).
It can handle:
Comma separated (CSV) files
Tab separated (TSV) files
Large files
Local files
Files on S3
Parallel imports within AWS using AWS Step Functions to import > 4M rows per minute
No dependencies (no need for .NET, Python, Node.js, Docker, AWS CLI etc.)
It's available for MacOS, Linux, Windows and Docker: https://github.com/a-h/ddbimport
Here's the results of my tests showing that it can import a lot faster in parallel using AWS Step Functions.
I'm describing the tool in more detail at AWS Community Summit on the 15th May 2020 at 1155 BST - https://www.twitch.tv/awscomsum
Before getting to my code, some notes on testing this locally
I recommend using a local version of DynamoDB, in case you want to sanity check this before you start incurring charges and what not. I made some small modifications before posting this, so be sure to test with whatever means make sense to you. There is a fake batch upload job I commented out, which you could use in lieu of any DynamoDB service, remote or local, to verify in stdout that this is working to your needs.
dynamodb-local
See dynamodb-local on npmjs or manual install
If you went the manual install route, you can start dynamodb-local with something like this:
java -Djava.library.path=<PATH_TO_DYNAMODB_LOCAL>/DynamoDBLocal_lib/\
-jar <PATH_TO_DYNAMODB_LOCAL>/DynamoDBLocal.jar\
-inMemory\
-sharedDb
The npm route may be simpler.
dynamodb-admin
Along with that, see dynamodb-admin.
I installed dynamodb-admin with npm i -g dynamodb-admin. It can then be run with:
dynamodb-admin
Using them:
dynamodb-local defaults to localhost:8000.
dynamodb-admin is a web page that defaults to localhost:8001. Once you launch these two services, open localhost:8001 in your browser to view and manipulate the database.
The script below doesn't create the database. Use dynamodb-admin for this.
Credit goes to...
Ben Nadel.
The code
I'm not as experienced with JS & Node.js as I am with other languages, so please forgive any JS faux pas.
You'll notice each group of concurrent batches is purposely slowed down by 900ms. This was a hacky solution, and I'm leaving it here to serve as an example (and because of laziness, and because you're not paying me).
If you increase MAX_CONCURRENT_BATCHES, you will want to calculate the appropriate delay amount based on your WCU, item size, batch size, and the new concurrency level.
Another approach would be to turn on Auto Scaling and implement exponential backoff for each failed batch. Like I mention below in one of the comments, this really shouldn't be necessary with some back-of-the-envelope calculations to figure out how many writes you can actually do, given your WCU limit and data size, and just let your code run at a predictable rate the entire time.
You might wonder why I didn't just let AWS SDK handle concurrency. Good question. Probably would have made this slightly simpler. You could experiment by applying the MAX_CONCURRENT_BATCHES to the maxSockets config option, and modifying the code that creates arrays of batches so that it only passes individual batches forward.
/**
* Uploads CSV data to DynamoDB.
*
* 1. Streams a CSV file line-by-line.
* 2. Parses each line to a JSON object.
* 3. Collects batches of JSON objects.
* 4. Converts batches into the PutRequest format needed by AWS.DynamoDB.batchWriteItem
* and runs 1 or more batches at a time.
*/
const AWS = require("aws-sdk")
const chalk = require('chalk')
const fs = require('fs')
const split = require('split2')
const uuid = require('uuid')
const through2 = require('through2')
const { Writable } = require('stream');
const { Transform } = require('stream');
const CSV_FILE_PATH = __dirname + "/../assets/whatever.csv"
// A whitelist of the CSV columns to ingest.
const CSV_KEYS = [
"id",
"name",
"city"
]
// Inadequate WCU will cause "insufficient throughput" exceptions, which in this script are not currently
// handled with retry attempts. Retries are not necessary as long as you consistently
// stay under the WCU, which isn't that hard to predict.
// The number of records to pass to AWS.DynamoDB.DocumentClient.batchWrite
// See https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_BatchWriteItem.html
const MAX_RECORDS_PER_BATCH = 25
// The number of batches to upload concurrently.
// https://docs.aws.amazon.com/sdk-for-javascript/v2/developer-guide/node-configuring-maxsockets.html
const MAX_CONCURRENT_BATCHES = 1
// MAKE SURE TO LAUNCH `dynamodb-local` EXTERNALLY FIRST IF USING LOCALHOST!
AWS.config.update({
region: "us-west-1"
,endpoint: "http://localhost:8000" // Comment out to hit live DynamoDB service.
});
const db = new AWS.DynamoDB()
// Create a file line reader.
var fileReaderStream = fs.createReadStream(CSV_FILE_PATH)
var lineReaderStream = fileReaderStream.pipe(split())
var linesRead = 0
// Attach a stream that transforms text lines into JSON objects.
var skipHeader = true
var csvParserStream = lineReaderStream.pipe(
through2(
{
objectMode: true,
highWaterMark: 1
},
function handleWrite(chunk, encoding, callback) {
// ignore CSV header
if (skipHeader) {
skipHeader = false
callback()
return
}
linesRead++
// transform line into stringified JSON
const values = chunk.toString().split(',')
const ret = {}
CSV_KEYS.forEach((keyName, index) => {
ret[keyName] = values[index]
})
ret.line = linesRead
console.log(chalk.cyan.bold("csvParserStream:",
"line:", linesRead + ".",
chunk.length, "bytes.",
ret.id
))
callback(null, ret)
}
)
)
// Attach a stream that collects incoming json lines to create batches.
// Outputs an array (<= MAX_CONCURRENT_BATCHES) of arrays (<= MAX_RECORDS_PER_BATCH).
var batchingStream = (function batchObjectsIntoGroups(source) {
var batchBuffer = []
var idx = 0
var batchingStream = source.pipe(
through2.obj(
{
objectMode: true,
writableObjectMode: true,
highWaterMark: 1
},
function handleWrite(item, encoding, callback) {
var batchIdx = Math.floor(idx / MAX_RECORDS_PER_BATCH)
if (idx % MAX_RECORDS_PER_BATCH == 0 && batchIdx < MAX_CONCURRENT_BATCHES) {
batchBuffer.push([])
}
batchBuffer[batchIdx].push(item)
if (MAX_CONCURRENT_BATCHES == batchBuffer.length &&
MAX_RECORDS_PER_BATCH == batchBuffer[MAX_CONCURRENT_BATCHES-1].length)
{
this.push(batchBuffer)
batchBuffer = []
idx = 0
} else {
idx++
}
callback()
},
function handleFlush(callback) {
if (batchBuffer.length) {
this.push(batchBuffer)
}
callback()
}
)
)
return (batchingStream);
})(csvParserStream)
// Attach a stream that transforms batch buffers to collections of DynamoDB batchWrite jobs.
var databaseStream = new Writable({
objectMode: true,
highWaterMark: 1,
write(batchBuffer, encoding, callback) {
console.log(chalk.yellow(`Batch being processed.`))
// Create `batchBuffer.length` batchWrite jobs.
var jobs = batchBuffer.map(batch =>
buildBatchWriteJob(batch)
)
// Run multiple batch-write jobs concurrently.
Promise
.all(jobs)
.then(results => {
console.log(chalk.bold.red(`${batchBuffer.length} batches completed.`))
})
.catch(error => {
console.log( chalk.red( "ERROR" ), error )
callback(error)
})
.then( () => {
console.log( chalk.bold.red("Resuming file input.") )
setTimeout(callback, 900) // slow down the uploads. calculate this based on WCU, item size, batch size, and concurrency level.
})
// return false
}
})
batchingStream.pipe(databaseStream)
// Builds a batch-write job that runs as an async promise.
function buildBatchWriteJob(batch) {
let params = buildRequestParams(batch)
// This was being used temporarily prior to hooking up the script to any dynamo service.
// let fakeJob = new Promise( (resolve, reject) => {
// console.log(chalk.green.bold( "Would upload batch:",
// pluckValues(batch, "line")
// ))
// let t0 = new Date().getTime()
// // fake timing
// setTimeout(function() {
// console.log(chalk.dim.yellow.italic(`Batch upload time: ${new Date().getTime() - t0}ms`))
// resolve()
// }, 300)
// })
// return fakeJob
let promise = new Promise(
function(resolve, reject) {
let t0 = new Date().getTime()
let printItems = function(msg, items) {
console.log(chalk.green.bold(msg, pluckValues(batch, "id")))
}
let processItemsCallback = function (err, data) {
if (err) {
console.error(`Failed at batch: ${pluckValues(batch, "line")}, ${pluckValues(batch, "id")}`)
console.error("Error:", err)
reject()
} else {
var params = {}
params.RequestItems = data.UnprocessedItems
var numUnprocessed = Object.keys(params.RequestItems).length
if (numUnprocessed != 0) {
console.log(`Encountered ${numUnprocessed}`)
printItems("Retrying unprocessed items:", params)
db.batchWriteItem(params, processItemsCallback)
} else {
console.log(chalk.dim.yellow.italic(`Batch upload time: ${new Date().getTime() - t0}ms`))
resolve()
}
}
}
db.batchWriteItem(params, processItemsCallback)
}
)
return (promise)
}
// Build request payload for the batchWrite
function buildRequestParams(batch) {
var params = {
RequestItems: {}
}
params.RequestItems.Provider = batch.map(obj => {
let item = {}
CSV_KEYS.forEach((keyName, index) => {
if (obj[keyName] && obj[keyName].length > 0) {
item[keyName] = { "S": obj[keyName] }
}
})
return {
PutRequest: {
Item: item
}
}
})
return params
}
function pluckValues(batch, fieldName) {
var values = batch.map(item => {
return (item[fieldName])
})
return (values)
}
Here's my solution. I relied on the fact that there was some type of header indicating what column did what. Simple and straight forward. No pipeline nonsense for a quick upload..
import os, json, csv, yaml, time
from tqdm import tqdm
# For Database
import boto3
# Variable store
environment = {}
# Environment variables
with open("../env.yml", 'r') as stream:
try:
environment = yaml.load(stream)
except yaml.YAMLError as exc:
print(exc)
# Get the service resource.
dynamodb = boto3.resource('dynamodb',
aws_access_key_id=environment['AWS_ACCESS_KEY'],
aws_secret_access_key=environment['AWS_SECRET_KEY'],
region_name=environment['AWS_REGION_NAME'])
# Instantiate a table resource object without actually
# creating a DynamoDB table. Note that the attributes of this table
# are lazy-loaded: a request is not made nor are the attribute
# values populated until the attributes
# on the table resource are accessed or its load() method is called.
table = dynamodb.Table('data')
# Header
header = []
# Open CSV
with open('export.csv') as csvfile:
reader = csv.reader(csvfile,delimiter=',')
# Parse Each Line
with table.batch_writer() as batch:
for index,row in enumerate(tqdm(reader)):
if index == 0:
#save the header to be used as the keys
header = row
else:
if row == "":
continue
# Create JSON Object
# Push to DynamoDB
data = {}
# Iterate over each column
for index,entry in enumerate(header):
data[entry.lower()] = row[index]
response = batch.put_item(
Item=data
)
# Repeat
Another quick workaround is to load your CSV to RDS or any other mysql instance first, which is quite easy to do (https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Introduction.html) and then use DMS (AWS Database Migration Service) to load the entire data to dynamodb. You'll have to create a role for DMS before you can load the data. But this works wonderfully without having to run any scripts.
I used https://github.com/GorillaStack/dynamodb-csv-export-import. It is super simple and worked like a charm. I just followed the instructions in the README:
# Install globally
npm i -g #gorillastack/dynamodb-csv-export-import
# Set AWS region
export AWS_DEFAULT_REGION=us-east-1
# Use it for your CSV and dynamo table
dynamodb-csv-export-import my-exported-file.csv MyDynamoDbTableName
Here's a simpler solution. And with this solution, you don't have to remove empty string attributes.
require('./env'); //contains aws secret/access key
const parse = require('csvtojson');
const AWS = require('aws-sdk');
// --- start user config ---
const CSV_FILENAME = __dirname + "/002_subscribers_copy_from_db.csv";
const DYNAMODB_TABLENAME = '002-Subscribers';
// --- end user config ---
//You could add your credentials here or you could
//store it in process.env like I have done aws-sdk
//would detect the keys in the environment
AWS.config.update({
region: process.env.AWS_REGION
});
const db = new AWS.DynamoDB.DocumentClient({
convertEmptyValues: true
});
(async ()=>{
const json = await parse().fromFile(CSV_FILENAME);
//this is efficient enough if you're processing small
//amounts of data. If your data set is large then I
//suggest using dynamodb method .batchWrite() and send
//in data in chunks of 25 (the limit) and find yourself
//a more efficient loop if there is one
for(var i=0; i<json.length; i++){
console.log(`processing item number ${i+1}`);
let query = {
TableName: DYNAMODB_TABLENAME,
Item: json[i]
};
await db.put(query).promise();
/**
* Note: If "json" contains other nested objects, you would have to
* loop through the json and parse all child objects.
* likewise, you would have to convert all children into their
* native primitive types because everything would be represented
* as a string.
*/
}
console.log('\nDone.');
})();
One way of importing/exporting stuff:
"""
Batch-writes data from a file to a dynamo-db database.
"""
import json
import boto3
# Get items from DynamoDB table like this:
# aws dynamodb scan --table-name <table-name>
# Create dynamodb client.
client = boto3.client(
'dynamodb',
aws_access_key_id='',
aws_secret_access_key=''
)
with open('', 'r') as file:
data = json.loads(file.read())['Items']
# Execute write-data request for each item.
for item in data:
client.put_item(
TableName='',
Item=item
)
The simplest solution is probably to use a template / solution made by AWS:
Implementing bulk CSV ingestion to Amazon DynamoDB
https://aws.amazon.com/blogs/database/implementing-bulk-csv-ingestion-to-amazon-dynamodb/
With this approach, you use the template provided to create a CloudFormation stack including an S3 bucket, a Lambda function, and a new DynamoDB table. The lambda is triggered to run on upload to the S3 bucket and inserts into the table in batches.
In my case, I wanted to insert into an existing table, so I just changed the Lambda function's environment variable once the stack was created.
Follow the instruction in the following link to import data to existing tables in DynamoDB:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/SampleData.LoadData.html
Please note, the name of the tables is what you must find here:
https://console.aws.amazon.com/dynamodbv2/home
And the name of the table is used inside the json file, the name of the json file itself is not important. For example I have a table as Country-kdezpod7qrap7nhpjghjj-staging, then for importing data to that table I must make a json file like this:
{
"Country-kdezpod7qrap7nhpjghjj-staging": [
{
"PutRequest": {
"Item": {
"id": {
"S": "ir"
},
"__typename": {
"S": "Country"
},
"createdAt": {
"S": "2021-01-04T12:32:09.012Z"
},
"name": {
"S": "Iran"
},
"self": {
"N": "1"
},
"updatedAt": {
"S": "2021-01-04T12:32:09.012Z"
}
}
}
}
]
}
If you don't know how to create the items for each PutRequest then you can create an item in your DB with mutation and then try to duplicate it, then it will show the structure of one item for you:
If you have a huge list of items in your CSV file, you can use the following npm tool to generate the json file:
https://www.npmjs.com/package/json-dynamo-putrequest
Then we can use the following command to import the data:
aws dynamodb batch-write-item --request-items file://Country.json
If it import the data successfully, you must see the following output:
{
"UnprocessedItems": {}
}
Also please note that with this method you can only have 25 PutRequest items in your array. So if you want to push 100 items you need to create 4 files.
You can try using batch writes and multiprocessing to speed up your bulk import.
import csv
import time
import boto3
from multiprocessing.dummy import Pool as ThreadPool
pool = ThreadPool(4)
current_milli_time = lambda: int(round(time.time() * 1000))
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('table_name')
def add_users_in_batch(data):
with table.batch_writer() as batch:
for item in data:
batch.put_item(Item = item)
def run_batch_migration():
start = current_milli_time()
row_count = 0
batch = []
batches = []
with open(CSV_PATH, newline = '') as csvfile:
reader = csv.reader(csvfile, delimiter = '\t', quotechar = '|')
for row in reader:
row_count += 1
item = {
'email': row[0],
'country': row[1]
}
batch.append(item)
if row_count % 25 == 0:
batches.append(batch)
batch = []
batches.append(batch)
pool.map(add_users_in_batch, batches)
print('Number of rows processed - ', str(row_count))
end = current_milli_time()
print('Total time taken for migration : ', str((end - start) / 1000), ' secs')
if __name__ == "__main__":
run_batch_migration()
Try this. This is much simple and helpful.
You can now natively bulk import into DynamoDB in CSV, DynamoDB JSON or Amazon Ion formats. This requires your data to be present in an S3 bucket. No code required.
blog - https://aws.amazon.com/blogs/database/amazon-dynamodb-can-now-import-amazon-s3-data-into-a-new-table/
docs - https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/S3DataImport.HowItWorks.html
Key considerations while using this native feature particularly for CSV data:
You can specify the table's Partition Key (PK)/Sort Key (SK) and their data types, and all other CreateTable parameters
Feature currently supports only importing into a new table each time
Data with the same PK and SK will be overwritten (similar to a PutItem operation)
Except for the PK and SK, all other fields in the CSV will be considered as DynamoDB Strings. If this is not favorable, you can convert the data into DynamoDB JSON/Amazon Ion format before importing with explicit data types
Any Global Secondary Indexes created as part of the ImportTable operation will be populated free of cost. Import cost depends on uncompressed source data size.
GSIs created at Import time will also map data types as per source data. All non key attributes will still however be considered as DynamoDB Strings
ImportTable consumes no write capacity on the table, so you could create the table with 1 WCU and the import performance will be same as a ImportTable performed for table with 100K WCU