I'm trying to code a program to handle some data from some computations for me and my group of chemists. I need to be able to search through an output file that contains an excess of 100,000 lines with repeating units. I'm having trouble developing an expression to pull the data that I need. Here is a sample portion of the output file from which I need to extract data.
--------------------------------------------------------
6 0.934039 1.910373 -0.007356
6 0.522681 1.025902 1.132490
6 1.295895 0.175184 1.887754
6 2.745917 -0.059133 1.663889
6 3.251755 -1.317777 1.620723
6 2.354786 -2.503163 1.659226
6 2.500165 -3.540741 2.548878
16 1.200827 -4.674482 2.398451
6 0.450003 -3.761272 1.124048
6 1.191704 -2.653037 0.838883
1 0.889391 -1.914706 0.104995
6 -0.831020 -4.156701 0.527821
6 -1.110322 -3.881221 -0.814127
6 -2.349055 -4.235426 -1.332810
7 -3.309523 -4.841208 -0.630339
6 -3.035523 -5.105663 0.648864
6 -1.833017 -4.789011 1.269474
1 -1.692631 -5.008941 2.323469
1 -3.826405 -5.592591 1.215268
1 -2.579432 -4.028326 -2.375778
1 -0.365461 -3.412291 -1.448955
6 3.573349 -3.754217 3.574993
1 3.816975 -2.812111 4.074804
1 3.260231 -4.473156 4.337010
1 4.494421 -4.124194 3.111155
6 4.710061 -1.617156 1.557050
6 5.623546 -1.003416 2.421004
6 6.974181 -1.326012 2.370518
6 7.436595 -2.269579 1.456291
6 6.535520 -2.894414 0.598497
6 5.182590 -2.576469 0.655016
1 4.479434 -3.078256 -0.004163
1 6.885058 -3.634965 -0.114866
1 8.492504 -2.519869 1.416345
1 7.667827 -0.841306 3.051107
1 5.267975 -0.268736 3.136940
6 3.580944 1.164538 1.503126
6 4.497172 1.300372 0.454818
6 5.240591 2.465784 0.310519
6 5.081537 3.515835 1.211519
6 4.166656 3.395074 2.253939
6 3.418260 2.231401 2.393481
1 2.692677 2.145384 3.197718
1 4.033033 4.209217 2.960243
1 5.661476 4.426693 1.096865
1 5.945953 2.554610 -0.510417
1 4.622653 0.485776 -0.251950
6 0.528985 -0.506746 2.884231
6 -0.800876 -0.209971 2.864388
16 -1.134679 0.961768 1.625169
6 -1.858335 -0.812159 3.682689
6 -1.740268 -2.136749 4.118466
6 -2.752842 -2.684280 4.894687
7 -3.862544 -2.030644 5.248979
6 -3.974818 -0.772056 4.819134
6 -3.016393 -0.122197 4.049880
1 -3.167697 0.912109 3.755521
1 -4.882415 -0.247888 5.110965
1 -2.672356 -3.711555 5.245040
1 -0.881625 -2.737495 3.833528
1 0.964846 -1.208304 3.585910
1 0.076921 2.185394 -0.628553
1 1.403779 2.830811 0.357050
1 1.665791 1.402320 -0.641683
Energy (Hartree) = -2216.64927779
Step: 34
Scan 1 out of 73
Converged(Max Force, RMS Force, Max Disp, RMS Disp): YES, YES, NO, YES
--------------------------------------------------------
--------------------------------------------------------
6 0.934062 1.911021 -0.006793
6 0.522693 1.026243 1.132808
6 1.295849 0.175204 1.887786
6 2.745849 -0.059151 1.663825
6 3.251670 -1.317799 1.620645
6 2.354637 -2.503134 1.659254
6 2.499913 -3.540512 2.549163
16 1.200720 -4.674404 2.398679
6 0.450032 -3.761466 1.124001
6 1.191706 -2.653228 0.838749
1 0.889499 -1.915053 0.104660
6 -0.830799 -4.157197 0.527556
6 -1.109977 -3.881704 -0.814420
6 -2.348529 -4.236224 -1.333319
7 -3.308924 -4.842349 -0.631043
6 -3.035027 -5.106853 0.648170
6 -1.832715 -4.789884 1.268992
1 -1.692398 -5.009943 2.322969
1 -3.825839 -5.594087 1.214409
1 -2.578819 -4.029095 -2.376300
1 -0.365153 -3.412510 -1.449097
6 3.572924 -3.753599 3.575540
1 3.816308 -2.811319 4.075158
1 3.259759 -4.472389 4.337671
1 4.494142 -4.123544 3.111975
6 4.709955 -1.617231 1.556911
6 5.623512 -1.003434 2.420751
6 6.974133 -1.326076 2.370222
6 7.436469 -2.269740 1.456057
6 6.535326 -2.894639 0.598379
6 5.182409 -2.576651 0.654945
1 4.479201 -3.078478 -0.004148
1 6.884808 -3.635261 -0.114937
1 8.492367 -2.520070 1.416076
1 7.667829 -0.841314 3.050721
1 5.268008 -0.268672 3.136635
6 3.580909 1.164495 1.503041
6 4.497111 1.300286 0.454710
6 5.240549 2.465680 0.310363
6 5.081547 3.515754 1.211345
6 4.166716 3.395021 2.253815
6 3.418302 2.231369 2.393405
1 2.692762 2.145371 3.197682
1 4.033146 4.209176 2.960116
1 5.661485 4.426605 1.096636
1 5.945883 2.554473 -0.510601
1 4.622556 0.485666 -0.252039
6 0.528927 -0.506827 2.884183
6 -0.800849 -0.209681 2.864659
16 -1.134600 0.962406 1.625752
6 -1.858353 -0.811785 3.682977
6 -1.740809 -2.136677 4.117983
6 -2.753416 -2.684132 4.894214
7 -3.862669 -2.030122 5.249222
6 -3.974426 -0.771225 4.820153
6 -3.015920 -0.121422 4.050946
1 -3.166775 0.913141 3.757259
1 -4.881643 -0.246746 5.112609
1 -2.673367 -3.711662 5.243914
1 -0.882533 -2.737672 3.832462
1 0.964690 -1.208688 3.585621
1 0.076974 2.186246 -0.627940
1 1.403854 2.831315 0.357930
1 1.665819 1.403131 -0.641235
Energy (Hartree) = -2216.64927781
Step: 35
Scan 1 out of 73
Converged(Max Force, RMS Force, Max Disp, RMS Disp): YES, YES, YES, YES
Optimized Parameters for Coordinate Value: 48.7864
--------------------------------------------------------
--------------------------------------------------------
6 0.928653 1.914728 -0.015952
6 0.523104 1.029664 1.125513
6 1.299323 0.175435 1.873723
6 2.746979 -0.062752 1.638872
6 3.250582 -1.322711 1.610272
6 2.350949 -2.505931 1.653003
6 2.487964 -3.535693 2.553022
16 1.187646 -4.668406 2.403419
6 0.447689 -3.765331 1.115501
6 1.193508 -2.661056 0.825700
1 0.897892 -1.928805 0.083033
6 -0.829701 -4.163896 0.513583
6 -1.098964 -3.899654 -0.832683
6 -2.334534 -4.256414 -1.357129
7 -3.300974 -4.854584 -0.656318
6 -3.036531 -5.108382 0.627053
6 -1.837984 -4.788211 1.253491
1 -1.705454 -4.999308 2.310312
1 -3.832212 -5.589178 1.191978
1 -2.557135 -4.057999 -2.403474
1 -0.348825 -3.437415 -1.466208
6 3.553334 -3.741710 3.588769
1 3.794918 -2.795527 4.081850
1 3.233495 -4.453260 4.354918
1 4.477112 -4.117327 3.134952
6 4.708676 -1.625521 1.559405
6 5.617384 -1.005953 2.424241
6 6.967681 -1.331651 2.386026
6 7.434526 -2.284193 1.483422
6 6.538165 -2.914834 0.624932
6 5.185521 -2.593731 0.669206
1 4.485944 -3.099951 0.009606
1 6.891156 -3.662361 -0.079406
1 8.490177 -2.536916 1.453052
1 7.657559 -0.842289 3.067116
1 5.258339 -0.264251 3.131155
6 3.588296 1.158822 1.495804
6 4.455584 1.334979 0.412017
6 5.190024 2.506923 0.274817
6 5.065591 3.527255 1.214482
6 4.191377 3.371568 2.286682
6 3.451948 2.201349 2.419198
1 2.755016 2.090210 3.245500
1 4.081099 4.164280 3.020955
1 5.637641 4.443718 1.104935
1 5.859723 2.624932 -0.572054
1 4.551226 0.545487 -0.327323
6 0.537765 -0.505652 2.874875
6 -0.791246 -0.204606 2.865525
16 -1.130701 0.970006 1.630565
6 -1.844454 -0.804660 3.690867
6 -1.727716 -2.130536 4.123087
6 -2.736173 -2.676023 4.906078
7 -3.840730 -2.019138 5.270315
6 -3.951832 -0.759281 4.843904
6 -2.997115 -0.111294 4.068471
1 -3.146985 0.924153 3.777407
1 -4.855222 -0.232457 5.143904
1 -2.656671 -3.704306 5.253685
1 -0.873433 -2.733722 3.830294
1 0.976622 -1.209660 3.572222
1 0.067886 2.192910 -0.630666
1 1.403422 2.833365 0.346506
1 1.654564 1.405702 -0.656173
Energy (Hartree) = -2216.64908578
Step: 1
Scan 2 out of 73
Converged(Max Force, RMS Force, Max Disp, RMS Disp): NO, YES, NO, NO
--------------------------------------------------------
Each section contains a set of coordinates for elements of a molecule, the total energy for the molecule, the step for the computational scan, and convergence criteria. If all four convergence criterion are met, the optimized coordinate scan value is added. The data I need to extract are the coordinates, the total energy, the scan number, and coordinate value for a converged block only!
I've tried tirelessly to develop a suitable expression to use for extracting the required sets of data. Currently, my RegEx code looks like the following:
\-*?\n(\d{1,2}(?:\s+[+-]?\d.*?)+)?\n\w+.*?([+-]?\d+\.\d+)\n\w.*?\n\w+\s(\d+).*?\n\w.*\n
That code is able to capture the entire section for coordinates, the energy, and the step number. Every time I try to go to the next line using \w, I immediately receive a catastrophic backtracking error. it's imperative that I have that next line in the expression as it's what differentiates the desired block of data over the other. I'm not terribly great with python or RegEx, and I'm requesting help. Once I have the correct expression, I'll be using a nested for loop to extract all of the data I need!
Demo
If there are any other questions I can answer to better describe my situation, please let me know! An explanation of what you do to help will be much appreciated as I want to learn as much as I can! Thank you for your help in advance!
One option is to make the pattern more specific and match the exact words instead of using \w+ and .*?
Based on the example data, if you want to capture the values for the coordinates, the total energy, the scan number you could use 3 capturing groups:
-+\r?\n(\d{1,2}(?:[^\S\r\n]+[+-]?\d+(?:\.\d+)?)*(?:\r?\n\d{1,2}(?:[^\S\r\n]+[+-]?\d+(?:\.\d+)?)*)*)\r?\nEnergy[^\S\r\n]+\([^[()]+\)[^\S\r\n]+=[^\S\r\n]+([+-]?\d+\.\d+)\r?\nStep:[^\S\r\n]+(\d+)
Explanation
-+\r?\n
( Caputure group 1
\d{1,2} Match 1-2 digits
(?: Non capture group
[^\S\r\n]+[+-]?\d+(?:\.\d+)? Repeat 1+ spaces and a digit with an optional decimal part
)* Close group and repeat 0+ times
(?: Non capture group
\r?\n\d{1,2} Match a newline and 1-2 digits
(?:[^\S\r\n]+[+-]?\d+(?:\.\d+)?)* Repeat 0+ times matching spaced followed by a digit with an optional decimal part
)* Close group and repeat 0+ times
) Close group 1
\r?\nEnergy[^\S\r\n]+\([^[()]+\)[^\S\r\n]+=[^\S\r\n]+
( Capture group 2
[+-]?\d+\.\d+ Match optional - or + and 1+ digit with a decimal part
) Close group 2
\r?\nStep:[^\S\r\n]+
(\d+) Capture group 3, match 1+ digits
Regex demo
Note that \s could also match a newline. To match whitespace chars without a newline, you could use a negated character class [^\S\r\n]
Related
I want to count areas of interest in my dataframe column 'which_AOI' (ranging from 0 -9). I would like to have a new column with the results added to a dataframe depending on a variable 'marker' (ranging from 0 - x) which tells me when one 'picture' is done and the next begins (one marker can go on for a variable length of rows). This is my code so far but it seems to be stuck and runs on without giving output. I tried reconstructing it from the beginning once but as soon as i get to 'if df.marker == num' it doesn't stop. What am I missing?
(example dataframe below)
## AOI count of spec. type function (in progress):
import numpy as np
import pandas as pd
path_i = "/Users/Desktop/Pilot/results/gazedata_filename.csv"
df = pd.read_csv(path_i, sep =",")
#create a new dataframe for AOIs:
d = {'marker': []}
df_aoi = pd.DataFrame(data=d)
### Creating an Aoi list
item = df.which_AOI
aoi = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] #list for search
aoi_array = [0, 0 , 0, 0, 0, 0, 0, 0, 0, 0] #list for filling
num = 0
for i in range (0, len (df.marker)): #loop through the dataframe
if df.marker == num: ## if marker = num its one picture
for index, item in enumerate(aoi): #look for item (being a number in which_AOI) in aoi list
if (item == aoi[index]):
aoi_array[index] += 1
print (aoi)
print (aoi_array)
se = pd.Series(aoi_array) # make list into a series to attach to dataframe
df_aoi['new_col'] = se.values #add list to dataframe
aoi_array.clear() #clears list before next picture
else:
num +=1
index pos_time pos_x pos_y pup_time pup_diameter marker which_AOI fixation Picname shock
1 16300 168.608779907227 -136.360855102539 16300 2.935715675354 0 7 18 5 save
2 16318 144.97673034668 -157.495513916016 16318 3.08838820457459 0 8 33 5 save
3 16351 152.92560577392598 -156.64172363281298 16351 3.0895299911499 0 7 17 5 save
4 16368 152.132453918457 -157.989685058594 16368 3.111008644104 0 7 18 5 save
5 16386 151.59835815429702 -157.55587768554702 16386 3.09514689445496 0 7 18 5 save
6 16404 150.88092803955098 -152.69479370117202 16404 3.10009074211121 1 7 37 5 save
7 16441 152.76554107666 -142.06188964843798 16441 3.0821495056152304 1 7 33 5 save
Not 100% clear based on your question but it sounds like you want to count the number of rows for each which_AOI value in each marker.
You can accomplish this using groupby
df_aoi = df.groupby(['marker','which_AOI']).size().unstack('which_AOI',fill_value=0)
In:
pos_time pos_x pos_y pup_time pup_diameter marker \
0 16300 168.608780 -136.360855 16300 2.935716 0
1 16318 144.976730 -157.495514 16318 3.088388 0
2 16351 152.925606 -156.641724 16351 3.089530 0
3 16368 152.132454 -157.989685 16368 3.111009 0
4 16386 151.598358 -157.555878 16386 3.095147 0
5 16404 150.880928 -152.694794 16404 3.100091 1
6 16441 152.765541 -142.061890 16441 3.082150 1
which_AOI fixation Picname shock
0 7 18 5 save
1 8 33 5 save
2 7 17 5 save
3 7 18 5 save
4 7 18 5 save
5 7 37 5 save
6 7 33 5 save
Out:
which_AOI 7 8
marker
0 4 1
1 2 0
I have a two dimensional list of values:
[
[[12.2],[5325]],
[[13.4],[235326]],
[[15.9],[235326]],
[[17.7],[53521]],
[[21.3],[42342]],
[[22.6],[6546]],
[[25.9],[34634]],
[[27.2],[523523]],
[[33.4],[235325]],
[[36.2],[235352]]
]
I would like to get a list of averages defined by a given step so that for a step=10 it would like like this:
[
[[10],[average of all 10-19]],
[[20],[average of all 20-29]],
[[30],[average of all 30-39]]
]
How can I achieve that? Please note that the number of 10s, 20s, 30s and so on is not always the same.
import pandas as pd
df = pd.DataFrame((q[0][0], q[1][0]) for q in thelist)
df['group'] = (df[0] / 10).astype(int)
Now df is:
0 1 group
0 12.2 5325 1
1 13.4 235326 1
2 15.9 235326 1
3 17.7 53521 1
4 21.3 42342 2
5 22.6 6546 2
6 25.9 34634 2
7 27.2 523523 2
8 33.4 235325 3
9 36.2 235352 3
Then:
df.groupby('group').mean()
Gives you the answers you seek:
0 1
group
1 14.80 132374
2 24.25 151761
3 34.80 235338
I'm using Python 2.7
I try do create new column based on variable form a list
tickers=['BAC','JPM','WFC','C','MS']
returns=pd.DataFrame
for tick in tickers:
returns[tick]=bank_stocks[tick][]1'Close'].pct_change()**
But I get this error
TypeError Traceback (most recent call last)
in ()
2 returns=pd.DataFrame
3 for tick in tickers:
----> 4 returns[tick]=bank_stocks[tick]['Close'].pct_change()
5
TypeError: 'type' object does not support item assignment
IIUC you need:
np.random.seed(100)
mux = pd.MultiIndex.from_product([['BAC','JPM','WFC','C','MS', 'Other'], ['Close', 'Open']])
df = pd.DataFrame(np.random.rand(10,12), columns=mux)
print (df)
BAC JPM WFC C \
Close Open Close Open Close Open Close
0 0.543405 0.278369 0.424518 0.844776 0.004719 0.121569 0.670749
1 0.185328 0.108377 0.219697 0.978624 0.811683 0.171941 0.816225
2 0.175410 0.372832 0.005689 0.252426 0.795663 0.015255 0.598843
3 0.980921 0.059942 0.890546 0.576901 0.742480 0.630184 0.581842
4 0.285896 0.852395 0.975006 0.884853 0.359508 0.598859 0.354796
5 0.376252 0.592805 0.629942 0.142600 0.933841 0.946380 0.602297
6 0.173608 0.966610 0.957013 0.597974 0.731301 0.340385 0.092056
7 0.395036 0.335596 0.805451 0.754349 0.313066 0.634037 0.540405
8 0.254258 0.641101 0.200124 0.657625 0.778289 0.779598 0.610328
9 0.976500 0.166694 0.023178 0.160745 0.923497 0.953550 0.210978
MS Other
Open Close Open Close Open
0 0.825853 0.136707 0.575093 0.891322 0.209202
1 0.274074 0.431704 0.940030 0.817649 0.336112
2 0.603805 0.105148 0.381943 0.036476 0.890412
3 0.020439 0.210027 0.544685 0.769115 0.250695
4 0.340190 0.178081 0.237694 0.044862 0.505431
5 0.387766 0.363188 0.204345 0.276765 0.246536
6 0.463498 0.508699 0.088460 0.528035 0.992158
7 0.296794 0.110788 0.312640 0.456979 0.658940
8 0.309000 0.697735 0.859618 0.625324 0.982408
9 0.360525 0.549375 0.271831 0.460602 0.696162
First select columns by slicers, then call pct_change and last remove second level of MultiIndex in column by droplevel:
tickers=['BAC','JPM','WFC','C','MS']
idx = pd.IndexSlice
df = df.sort_index(axis=1)
returns = df.loc[:, idx[tickers,'Close']].pct_change()
returns.columns = returns.columns.droplevel(-1)
print (returns)
BAC C JPM MS WFC
0 NaN NaN NaN NaN NaN
1 -0.658950 0.216885 -0.482477 2.157889 171.008452
2 -0.053515 -0.266325 -0.974108 -0.756436 -0.019738
3 4.592146 -0.028390 155.551779 0.997444 -0.066841
4 -0.708544 -0.390220 0.094841 -0.152103 -0.515801
5 0.316048 0.697588 -0.353910 1.039454 1.597555
6 -0.538586 -0.847159 0.519208 0.400649 -0.216890
7 1.275448 4.870415 -0.158370 -0.782213 -0.571905
8 -0.356369 0.129391 -0.751538 5.297934 1.486019
9 2.840595 -0.654320 -0.884181 -0.212630 0.186573
Your code is correct except the line in In[73] where you must call dataframe(i.e., pd.DataFrame()) you have created an object by not using '()' after DataFrame. Thats why the error is type object doesnot support assignment.
I used the below code:
import pandas as pd
pandas_bigram = pd.DataFrame(bigram_data)
print pandas_bigram
I got output as below
0
0 ashoka -**0
1 - wikipedia,**1
2 wikipedia, the**2
3 the free**2
4 free encyclopedia**2
5 encyclopedia ashoka**1
6 ashoka from**2
7 from wikipedia,**1
8 wikipedia, the**2
9 the free**2
10 free encyclopedia**2
My question is How to split this data frame. So, that i will get data in two rows. the data here is separated by "**".
import pandas as pd
df= [" ashoka -**0","- wikipedia,**1","wikipedia, the**2"]
df=pd.DataFrame(df)
print(df)
0
0 ashoka -**0
1 - wikipedia,**1
2 wikipedia, the**2
Use split function: The method split() returns a list of all the words in the string, using str as the separator (splits on all whitespace if left unspecified), optionally limiting the number of splits to num.
df1 = pd.DataFrame(df[0].str.split('*',1).tolist(),
columns = ['0','1'])
print(df1)
0 1
0 ashoka - *0
1 - wikipedia, *1
2 wikipedia, the *2
So far i have been able to merge two files and get the following dataframe (df1):
ID someLength someLongerSeq someSeq someMOD someValue
A 16 XCVBNMHGFDSTHJGF NMH T3(P) 7
A 16 XCVBNMHGFDSTHJGF NmH M3(O); S4(P); S6(P) 1
B 24 HDFGKJSDHFGKJSDFHGKLSJDF HFGKJSDFH S9(P) 5
C 22 QIOWEURQOIWERERQWEFFFF RQoIWERER Q16(D); S19(P) 7
D 19 HSEKDFGSFDKELJGFZZX KELJ S7(P); C9(C); S10(P) 1
i am looking for a way to do a regex match based on "someSeq" column to look for that substring in the "someLongersSeq" column and get the start location of the match and then add that to the whole numbers that are attached to the characters such as T3(P).
Example:
For the second row "ID:A","someSeq":"NmH" matches starts at location 4 of the someLongerSeq (after to upper conversion of NmH). So i want to add that number 4 to someMOD fields M3(O);S4(P);S6(P) so that i get M7(O);S8(P);S10(P) and then overwrite the new value in the someMOD column.
And do that for each row. Regex is per row bases.
Any help is really appreciated. Thanks.
First of all, I should mention that it is hard to read your data. I slightly modify it( I remove spaces from someMOD column) to read them. This is not a problem since you have already your data into a data.frame. So I read the data like this :
dat <- read.table(text='ID someLength someLongerSeq someSeq someMOD someValue
A 16 XCVBNMHGFDSTHJGF NMH T3(P) 7
A 16 XCVBNMHGFDSTHJGF NmH M3(O);S4(P);S6(P) 1
B 24 HDFGKJSDHFGKJSDFHGKLSJDF HFGKJSDFH S9(P) 5
C 22 QIOWEURQOIWERERQWEFFFF RQoIWERER Q16(D);S19(P) 7
D 19 HSEKDFGSFDKELJGFZZX KELJ S7(P);C9(C);S10(P) 1',header=TRUE)
Then the idea is:
to process row by row using apply
use gregexpr to get the index of someSeq into someLongerSeq
use gsubfn to add the previous index to its digit of someMOD
Here the whole solution:
library(gsubfn)
res <- t(apply(dat,1,function(x){
idx <- gregexpr(x['someSeq'],x['someLongerSeq'],
ignore.case = TRUE)[[1]][1]
x[['someMOD']] <- gsubfn("[[:digit:]]+",
function(x) as.numeric(x)+idx,
x[['someMOD']])
x
}))
as.data.frame(res)
ID someLength someLongerSeq someSeq someMOD someValue
1 A 16 XCVBNMHGFDSTHJGF NMH T8(P) 7
2 A 16 XCVBNMHGFDSTHJGF NmH M8(O);S9(P);S11(P) 1
3 B 24 HDFGKJSDHFGKJSDFHGKLSJDF HFGKJSDFH S18(P) 5
4 C 22 QIOWEURQOIWERERQWEFFFF RQoIWERER Q23(D);S26(P) 7
5 D 19 HSEKDFGSFDKELJGFZZX KELJ S18(P);C20(C);S21(P) 1