I have a text which I want to convert into a dictionary.
Here's the format of the text :
Apple 0
orange 5:
text1 : random text
text2 : random text
text3 : random text
text4 : random text
orange 6:
text1 : random text
text2 : random text
text3 : random text
text4 : random text
Apple 1
orange 12:
text1 : random text
text2 : random text
text3 : random text
text4 : random text
orange 13:
text1 : random text
text2 : random text
text3 : random text
text4 : random text
I want to convert the dictionary something like this :
dic_text = {'apple-0-orange-5-text1' : 'random text','apple-0-orange-5-text2' : 'random text','apple-0-orange-5-text3' : 'random text','apple-0-orange-5-text4' : 'random text','apple-0-orange-6-text1' : 'random text','apple-0-orange-6-text2' : 'random text','apple-0-orange-6-text3' : 'random text','apple-0-orange-6-text4' : 'random text','apple-1-orange-12-text1' : 'random text','apple-1-orange-12-text2' : 'random text','apple-1-orange-12-text3' : 'random text','apple-1-orange-12-text4' : 'random text','apple-1-orange-13-text1' : 'random text','apple-1-orange-13-text2' : 'random text','apple-1-orange-13-text3' : 'random text','apple-1-orange-13-text4' : 'random text'}
Can anyone tell me a generic way of making a dictionary something like above?
Assuming the following information that you did not provide (please edit the question clarifying if this holds or not):
That all the elements are on separate lines
That all the elements take at most one line (so random text does not span multiple lines)
That you want the keys in lowercase
That you do not want to preserve the whitespace at beginning/end of the keys and random text
random text cannot be just whitespace
The "Apple X" line does not contain a :
The "orange Y" line is the only kind of line that ends in : (plus eventually whitespace), so random text cannot end in :.
After an "Apple X" line there is always an "orange Y" line (possibly after some empty lines).
Then you can do something like this:
def build_dict(iterable):
result = {}
main_key = None
sub_key = None
for line in iterable:
# remove whitespace at beginning/end of line
line = line.strip()
if not line:
# throw away empty lines
continue
elif ':' not in line:
# we found an "Apple X" line, transform that into apple-X
main_key = '-'.join(line.lower().split())
sub_key = None
elif line[-1] == ':':
# we found an "orange X" line
sub_key = '-'.join(line.lower().split())
else:
# add a `textX : random_text` element
key, value = line.split(':')
result['-'.join([main_key, sub_key, key.strip()])] = value.strip()
return result
So you keep track of which Apple X value is in the main_key, and which orange Y value is in the sub_key and after that all lines text X : random_text are splitted on : and the three keys are combined and the value is saved in the dictionary.
If the assumptions I made do not hold then you have to handle things like multiline values etc, which depends on exactly the format of the file.
Related
Data :
col 1
AL GHAITHA
AL ASEEL
EMARAT AL
LOREAL
ISLAND CORAL
My code :
def remove_words(df, col, letters):
regular_expression = '^' + '|'.join(letters)
df[col] = df[col].apply(lambda x: re.sub(regular_expression, "", x))
Desired output :
col 1
GHAITHA
ASEEL
EMARAT
LOREAL
ISLAND CORAL
SUNRISE
Function call :
letters = ['AL','SUPERMARKET']
remove_words(df=df col='col 1',letters=remove_letters)
Basically, i wanted remove the letters provided either at the start or end. ( note : it should be seperate string)
Fog eg : "EMARAT AL" should become "EMARAT"
Note "LOREAL" should not become "LORE"
Code to build the df :
raw_data = {'col1': ['AL GHAITHA', 'AL ASEEL', 'EMARAT AL', 'LOREAL UAE',
'ISLAND CORAL','SUNRISE SUPERMARKET']
}
df = pd.DataFrame(raw_data)
You may use
pattern = r'^{0}\b|\b{0}$'.format("|".join(map(re.escape, letters)))
df['col 1'] = df['col 1'].str.replace(pattern, r'\1').str.strip()
The (?s)^{0}\b|(.*)\b{0}$'.format("|".join(map(re.escape, letters)) pattern will create a pattern like (?s)^word\b|(.*)\bword$ and it will match word as a whole word at the start and end of the string.
When checking the word at the end of the string, the whole text before it will be captured into Group 1, hence the replacement pattern contains the \1 placeholder to restore that text in the resulting string.
If your letters list contains items only composed with word chars you may omit map with re.escape, replace map(re.escape, letters) with letters.
The .str.strip() will remove any resulting leading/trailing whitespaces.
See the regex demo.
import string
sentence = raw_input("Enter sentence:")
for i in string.punctuation:
sentence = sentence.replace(i," ")
word_list = sentence.split()
word_list.sort(key=str.lower)
print word_list
for j in word_list:
print j,":",word_list.count(j)
word_list.remove(j)
When I use this code and type in a sample sentence, some of my words are not counted correctly:
Sample sentence: I, are.politics:wodng!"frail A P, Python. Python Python frail
output:
['A', 'are', 'frail', 'frail', 'I', 'P', 'politics', 'Python', 'Python', 'Python', 'wodng']
A : 1
frail : 2
I : 1
politics : 1
Python : 3
wodng : 1
What happened to the words "are" and "P"? I know the problem is happening in the last few lines but I don't know what's causing it.
Thanks!
The problem in your code is, that you remove elements from the list over which you are iterating.
Therefore I suggest to separate the iterator by converting the word_list into a set. Then you can iterate over the set word_iter, which contains every word just one time. Then you also don't need to remove anything anymore. Disadvantage is the unordered result, as sets are unordered. But you can put the result in a list and order that afterwards:
import string
sentence = raw_input("Enter sentence:")
for i in string.punctuation:
sentence = sentence.replace(i," ")
word_list = sentence.split()
word_list.sort(key=str.lower)
print word_list
result = []
word_iter = set(word_list)
for j in word_iter:
print j, ':', word_list.count(j)
result.append( (j, word_list.count(j)) )
result:
A : 1
wodng : 1
Python : 3
I : 1
P : 1
are : 1
frail : 2
politics : 1
At the end of your script, your list is not empty
You remove each time a value, so the pointer jumps one value each time
How can i correctly print out this dictionary in a way i have each word sorted by the number of times(frequency) in the text?
slova = dict()
for line in text:
line = re.split('[^a-z]',text)
line[i] = filter(None,line)
i =+ 1
i = 0
for line in text:
for word in line:
if word not in slova:
slova[word] = i
i += 1
I'm not sure what your text looks like, and you also haven't provided example output, but here is what my guess is. If this doesn't help please update your question and I'll try again. The code makes use of Counter from collections to do all the heavy lifting. First all of the words in all of the lines of the text are flattened to a single list, then this list is simply passed to Counter. The keys of the Counter (the words) are then sorted by their counts and printed out.
CODE:
from collections import Counter
import re
text = ['hello hi hello yes hello',
'hello hi hello yes hello']
all_words = [w for l in text for w in re.split('[^a-z]',l)]
word_counts = Counter(all_words)
sorted_words = sorted(word_counts.keys(),
key=lambda k: word_counts[k],
reverse = True)
#Print out the word and counts
for word in sorted_words:
print word,word_counts[word]
OUTPUT:
hello 6
yes 2
hi 2
I have got a lot of txt cards with same format.
And I need parse it to get some values from them.
I don't understand how use regexp substr in Oracle. Please, Help me write sql statements, which return values, which I marked between **-symbols (for example: first string, 02/02/11, AA11223344 and etc):
From: "abc def (**first string**)" <email#site.com>
02/01/2011 09:27 First Date : **02/02/11**
Player : BILL BROWN ID : **AA11223344**
At : YELLOW STREET. CD Number : **A11223FER**
Code :
BUYS : **123M** (M) of AAA 0 02/02/11 Owner : **England**
Shiped : **02/04/11**
Number : **11.223344** Digi : **1.2370000**
Ddate: **02/04/11**
Notes : **First line here**
* Size : **USD 11,222,333.44**
* Own ( **0 days** ): **0.00**
* Total : USD **222,333,444.55**
You can recursively apply regexp evaluation by using hierarchical queries; at each level, you look for the level-th occurrence in your string.
Please pay attention to the "non greedy" operator (??) in pattern string, explained here, as well as regular expression functions.
with test as (
select 'From: "abc def (**first string**)" <email#site.com>
02/01/2011 09:27 First Date : **02/02/11**
Player : BILL BROWN ID : **AA11223344**
At : YELLOW STREET. CD Number : **A11223FER**
Code :
BUYS : **123M** (M) of AAA 0 02/02/11 Owner : **England**
Shiped : **02/04/11**
Number : **11.223344** Digi : **1.2370000**
Ddate: **02/04/11**
Notes : **First line here**
* Size : **USD 11,222,333.44**
* Own ( **0 days** ): **0.00**
* Total : USD **222,333,444.55** ' as txt
from dual
)
select TRIM('*' FROM regexp_substr(txt, '\*\*(.*??)\*\*', 1, LEVEL, 'n') )
from test
CONNECT BY regexp_subSTR(txt, '\*\*(.*??)\*\*', 1, LEVEL, 'n') is not null
I have a text file i want to parse. The file has multiple items I want to extract. I want to capture everything in between a colon ":" and a particular word. Let's take the following example.
Description : a pair of shorts
amount : 13 dollars
requirements : must be blue
ID1 : 199658
----
The following code parses the information out.
import re
f = open ("parse.txt", "rb")
fileRead = f.read()
Description = re.findall("Description :(.*?)amount", fileRead, re.DOTALL)
amount = re.findall("amount :(.*?)requirements", fileRead, re.DOTALL)
requirements = re.findall("requirements :(.*?)ID1", fileRead, re.DOTALL)
ID1 = re.findall("ID1 :(.*?)-", fileRead, re.DOTALL)
print Description[0]
print amount[0]
print requirements[0]
print ID1[0]
f.close()
The problem is that sometimes the text file will have a new line such as this
Description
: a pair of shorts
amount
: 13 dollars
requirements: must be blue
ID1: 199658
----
In this case my code will not work because it is unable to find "Description :" because it is now separated into a new line. If I choose to change the search to ":(.*?)requirements" it will not return just the 13 dollars, it will return a pair of shorts and 13 dollars because all of that text is in between the first colon and the word, requirements. I want to have a way of parsing out the information no matter if there is a line break or not. I have hit a road block and your help would be greatly appreciated.
You can use a regex like this:
Description[^:]*(.*)
^--- use the keyword you want
Working demo
Quoting your code you could use:
import re
f = open ("parse.txt", "rb")
fileRead = f.read()
Description = re.findall("Description[^:]*(.*)", fileRead)
amount = re.findall("amount[^:]*(.*)", fileRead)
requirements = re.findall("requirements[^:]*(.*)", fileRead)
ID1 = re.findall("ID1[^:]*(.*)", fileRead)
print Description[0]
print amount[0]
print requirements[0]
print ID1[0]
f.close()
You can simply do this:
import re
f = open ("new.txt", "rb")
fileRead = f.read()
keyvals = {k.strip():v.strip() for k,v in dict(re.findall('([^:]*):(.*)(?=\b[^:]*:|$)',fileRead,re.M)).iteritems()}
print(keyvals)
f.close()
Output:
{'amount': '13 dollars', 'requirements': 'must be blue', 'Description': 'a pair of shorts', 'ID1': '199658'}