elasticsearch bulk insert from s3 using lambda and python - amazon-web-services

I have a task to delete all docs from an ElasticSearch index, and repopulate it from all the files in a S3 bucket. I can do this singly but it is extremely slow, but doing it with bulk insert is eluding me and is just as slow. What am I missing? I feel that I am making the benefit of paginator's speed null and void, plus I am getting a MemoryError at the end when I am trying to gather all 1000 at at time and bulk insert it into Elasticsearch. If Bulk insert is possible, how do I do it when creating the file to insert into ES from scratch and storing it into memory. I want it to be as fast as paginator, but I am failing miserably here. What am I missing please?
I have put the data examples in the code as comments (lines 15, 38, 47-48)
1 s3_bucket='s3_bucket_name'
2 s3 = boto3.resource('s3')
3 paginator = s3_client.get_paginator("list_objects_v2")
4 es_url='https://my_es_url'
5 aws_auth=my_auth_info
6 es_client=Elasticsearch([es_url], .....)
7
8 # get keys from s3 bucket
9 for page in paginator.paginate(Bucket=s3_bucket, Prefix=''):
10 if page['KeyCount']<=0 or "Contents not in page:
11 continue
12 keys= [object["Key"] for object in page["contents"]]
13 if not keys:
14 continue # nothing to see here
15 objects=[{"Key":key"} for key in keys] # [{'key': 'path/filename1_02032022T000007Z.json'}, etc...]
16 # paginator maxes out at 1000, so the number of keys in objects are 1000 at a time (total count > 100,000)
17 count = len(objects)
18 total +=count
19 if dryrun:
20 print("dryrun: {0} objects".format(count)) # 1000
21 print("{0} total objects".format(total)) # 102701 total objects
22 else
23 es_index='es_index_name'
24 # delete docs from ES index
25 es_client.delete_by_query(index=es_index, body={"query":{"match_all":{}}})
26
27 # copy form s3 to ES ##really slow after this because getting data one at a time
28 actions=[] #list
29 index={} #dict
30 index[_index']=es_index
31 doc={}
32 for obj in objects:
33 key_id=obj['Key']
34 id.partition('.json')[0]
35 index['_id']=id # {'_index': 'es_index_name', '_id': 'filename'}
36 # get object body
37 data_obj = s3.Object(s3_bucket,key_id)
38 data_json=data_obj.get() # {'data1':'value1','data2':'value2','data3':'value3'}
39 data_json=data_json.get('Body').read().decode('utf-8')
40 data_json=data_json.replace("'",'"')
41 doc=data_string
42 doc['id']=id
43 actions.append(index)
44 actions+=str(actions)+'\n'
45 actions+=str(doc)+'\n' # GETTING MEMORYERROR HERE
46 # actions:
47 # [{'_index':'es_index_name','_id':'filename1_02032022T000007Z'},
48 # {'data1':'value1','data2':'value2','data3':'value3'}]
49 print('--------------------------------')
50 response=helpers.bulk(es_client,actions)
51 print('Data successfully inserted into ES')
52 print("PROCESS_ALL: {0} objects".format(count)) # 1000
53 print("{0} total objects".format(total)) # 102701 total objects

Related

Severe memory leak with Django

I am facing a problem of huge memory leak on a server, serving a Django (1.8) app with Apache or Ngnix (The issue happens on both).
When I go on certain pages (let's say on the specific request below) the RAM of the server goes up to 16 G in few seconds (with only one request) and the server freeze.
def records(request):
"""Return list 14 last records page. """
values = []
time = timezone.now() - timedelta(days=14)
record =Records.objetcs.filter(time__gte=time)
return render(request,
'record_app/records_newests.html',
{
'active_nav_tab': ["active", "", "", ""]
' record': record,
})
When I git checkout to older version, back when there was no such problem, the problem survives and i have the same issue.
I Did a memory check with Gumpy for the faulty request here is the result:
>>> hp.heap()
Partition of a set of 7042 objects. Total size = 8588675016 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 1107 16 8587374512 100 8587374512 100 unicode
1 1014 14 258256 0 8587632768 100 django.utils.safestring.SafeText
2 45 1 150840 0 8587783608 100 dict of 0x390f0c0
3 281 4 78680 0 8587862288 100 dict of django.db.models.base.ModelState
4 326 5 75824 0 8587938112 100 list
5 47 1 49256 0 8587987368 100 dict of 0x38caad0
6 47 1 49256 0 8588036624 100 dict of 0x39ae590
7 46 1 48208 0 8588084832 100 dict of 0x3858ab0
8 46 1 48208 0 8588133040 100 dict of 0x38b8450
9 46 1 48208 0 8588181248 100 dict of 0x3973fe0
<164 more rows. Type e.g. '_.more' to view.>
After a day of search I found my answer.
While investigating I checked statistics on my DB and saw that some table was 800Mo big but had only 900 rows. This table contains a Textfield without max len. Somehow one text field got a huge amount of data inserted into and this line was slowing everything down on every pages using this model.

Eliminate duplicate pairs of numbers and total their quantities in openoffice Calc

I have a Calc sheet listing a cut-list for plywood in two columns with a quantity in a third column. I would like to remove duplicate matching pairs of dimensions and total the quantity. Starting with:
A B C
25 35 2
25 40 1
25 45 3
25 45 2
35 45 1
35 50 3
40 25 1
40 25 1
Ending with:
A B C
25 35 2
25 40 1
25 45 5
35 45 1
35 50 3
40 25 2
I'm trying to automate this. Currently I have multiple lists which occupy the same page which need to be totaled independently of each other.
Put a unique different ListId, ListCode or ListNumber for each of the lists. Let all rows falling into the same list, have the same value for this field.
Concatenate A & B and form a new column, say, PairAB.
If the list is small and handlable, filter for PairAB and collect totals.
Otherwise, use Grouping and subtotals to get totals for each list and each pair, grouping on ListId and PairAB.
If the list is very large, you are better off taking it to CSV, and onward to a database, such things are simple child's play in SQL.

Python Pandas read_csv issue

I have simple CSV file that looks like this:
inches,12,3,56,80,45
tempF,60,45,32,80,52
I read in the CSV using this command:
import pandas as pd
pd_obj = pd.read_csv('test_csv.csv', header=None, index_col=0)
Which results in this structure:
1 2 3 4 5
0
inches 12 3 56 80 45
tempF 60 45 32 80 52
But I want this (unnamed index column):
0 1 2 3 4
inches 12 3 56 80 45
tempF 60 45 32 80 52
EDIT: As #joris pointed out additional methods can be run on the resulting DataFrame to achieve the wanted structure. My question is specifically about whether or not this structure could be achieved through read_csv arguments.
from the documentation of the function:
names : array-like
List of column names to use. If file contains no header row, then you
should explicitly pass header=None
so, apparently:
pd_obj = pd.read_csv('test_csv.csv', header=None, index_col=0, names=range(5))

Read multiple *.txt files into Pandas Dataframe with filename as column header

I am trying to import a set of *.txt files. I need to import the files into successive columns of a Pandas DataFrame in Python.
Requirements and Background information:
Each file has one column of numbers
No headers are present in the files
Positive and negative integers are possible
The size of all the *.txt files is the same
The columns of the DataFrame must have the name of file (without extension) as the header
The number of files is not known ahead of time
Here is one sample *.txt file. All the others have the same format.
16
54
-314
1
15
4
153
86
4
64
373
3
434
31
93
53
873
43
11
533
46
Here is my attempt:
import pandas as pd
import os
import glob
# Step 1: get a list of all csv files in target directory
my_dir = "C:\\Python27\Files\\"
filelist = []
filesList = []
os.chdir( my_dir )
# Step 2: Build up list of files:
for files in glob.glob("*.txt"):
fileName, fileExtension = os.path.splitext(files)
filelist.append(fileName) #filename without extension
filesList.append(files) #filename with extension
# Step 3: Build up DataFrame:
df = pd.DataFrame()
for ijk in filelist:
frame = pd.read_csv(filesList[ijk])
df = df.append(frame)
print df
Steps 1 and 2 work. I am having problems with step 3. I get the following error message:
Traceback (most recent call last):
File "C:\Python27\TextFile.py", line 26, in <module>
frame = pd.read_csv(filesList[ijk])
TypeError: list indices must be integers, not str
Question:
Is there a better way to load these *.txt files into a Pandas dataframe? Why does read_csv not accept strings for file names?
You can read them into multiple dataframes and concat them together afterwards. Suppose you have two of those files, containing the data shown.
In [6]:
filelist = ['val1.txt', 'val2.txt']
print pd.concat([pd.read_csv(item, names=[item[:-4]]) for item in filelist], axis=1)
val1 val2
0 16 16
1 54 54
2 -314 -314
3 1 1
4 15 15
5 4 4
6 153 153
7 86 86
8 4 4
9 64 64
10 373 373
11 3 3
12 434 434
13 31 31
14 93 93
15 53 53
16 873 873
17 43 43
18 11 11
19 533 533
20 46 46
You're very close. ijk is the filename already, you don't need to access the list:
# Step 3: Build up DataFrame:
df = pd.DataFrame()
for ijk in filelist:
frame = pd.read_csv(ijk)
df = df.append(frame)
print df
In the future, please provide working code exactly as is. You import from pandas import * yet then refer to pandas as pd, implying the import import pandas as pd.
You also want to be careful with variable names. files is actually a single file path, and filelist and filesList have no discernible difference from the variable name. It also seems like a bad idea to keep personal documents in your python directory.

Merge Sort Array of Integers

Can anyone help me with this pratice question: 33 31 11 47 2 20 24 12 2 43. I am trying to figure out what the contents of the two output lists would be after the first pass of the Merge Sort.
The answer is supposedly:
List 1: 33 11 47 12
List 2: 31 2 20 24 2 43
Not making any sense to me sense I was under the impression that the first pass was where it divided it into two lists at the middle....
33 31 11 47 2 20 24 12
At first the list divides into individual elements such that when a single ton list is formed each element is compared with the one next to it. So after first pass we have
31 33 11 47 2 20 12 24
After that
11 31 33 47 2 12 20 24
and then
2 11 12 20 24 3133 37