extracting numbers in list to dict of lists - list

I have a dict of lists that looks similarly to this:
'H82746': ['Hsa.1070', 'H82746', 'U1', 'SMALL', 'NUCLEAR', 'RIBONUCLEOPROTEIN', 'C', ';.', '1.75', '1.46', '1.75', '1.69', '1.30', '1.11', '1.42', '1.11', '1.92', '0.99', '0.65', '1.69', '1.39', '1.29', '1.55', '2.00', '1.16', '0.70', '1.48', '0.78', '1.52', '1.28', '1.50', '0.79', '1.31', '1.56', '1.33', '1.66', '1.67', '1.34', '1.48', '0.38', '0.76', '1.27', '1.66', '1.12', '1.40', '1.23', '1.66', '1.58', '2.33', '1.25', '0.90', '0.63', '0.58', '0.97', '0.79', '0.90', '1.25', '1.52', '1.78', '1.56', '1.66', '1.39', '1.42', '1.07', '1.63', '2.00', '2.06', '1.37', '1.38', '1.33']
I want to extract all of the numbers (just the number entries like '1.66' and '1.42', not the number-letter-combos like 'H82746' or 'Hsa.1070'), so I can add them to the key in the dict of lists.
I have tried iterating over number and i, but I have not succeeded in extracting only the numbers. do you have any suggestions ?

It's not clear what exactly you want to achieve, but from what I understood you can do the following:
$ echo "<your_string>" | grep -o "'[0-9\.]*'" > extracted_numbers.txt
which will write number to the extracted_numbers.txt file in following format:
'1.75'
'1.46'
'1.75'
...

Related

How to call characters from first list with second list

I want to input two comma separated strings: the first a set of strings, the second a set of ranges and return substrings based on ranges, for example:
x=input("Input string to search: ")
search=x.split(',')
y=input("Input numbers to locate: ")
numbers=y.split(',')
I would then like to use the second list of ranges to print out specified characters from the first list.
An example:
Input string to search: abcdefffg,aabcdefghi,bbcccdefghi
Input numbers to locate: 1:2,2:3,5:9
I would like the output to look like this:
bc
bcd
defghi
Any suggestions? Thanks in advance!
split(':') splits a "range" into its two components. map(int, ...) converts them to integers. string[a:b] takes characters at indices a through b.
zip is an easy way to read from two different lists combined.
Let me know if you have any other questions:
x = "abcdefffg,aabcdefghi,bbcccdefghi"
search = x.split(',')
y = "1:2,2:3,5:9"
numbers = y.split(',')
results = []
for string, rng in zip(search, numbers):
start, how_many = map(int, rng.split(':'))
results.append(string[start:start+how_many])
print(" ".join(results))
# Output:
# bc bcd defghi

convert item in list to int

From the code below, I get: TypeError: can't multiply sequence by non-int of type 'list'
x = list(input("Input a list of integers, seperated by a comma:"))
def histogram(x):
for y in x:
print("*"*x)
histogram(x)
I am assuming you want something that will look a little like:
>>> histogram('1,5,1,3')
*
*****
*
***
You have to change a few things then.
First is, x = list(input()) will give you a list of letters/characters in your input. So list('1,2,3') will yield ['1', ',', '2', ',', '3'] and I'm sure you dont want to deal with the commas.
Instead, you should use string.split(',') to split on every occurence of comma. depending on how you want to do this, you can do it inside your function (like I did below) or directly to the output so that the function expects a list of strings of numbers.
I say "a list of strings of numbers" because input returns a string, so even if you enter numbers, it will give you STRINGS. you have to make sure to explicitly cast as int to be able to multiply '*' by it.
And last, the actual answer to your question? You are calling print("*"*x). I think you meant print("*"*y) or more precisely, print("*"*int(y)).
Below is what I assume to be working code, but again, I don't know what your expected output actually is.
>>> x = input("Please enter comma separated list of numbers: ")
Please enter comma separated list of numbers:
>>> x = input("Please enter comma separated list of numbers: ")
Please enter comma separated list of numbers: 1,2,3,4,5,6,1,2
>>> def histogram(x):
for y in x.split(','):
print("*"*int(y))
>>> histogram(x)
*
**
***
****
*****
******
*
**

Why is max number ignoring two-digit numbers?

At the moment I am saving a set of variables to a text file. I am doing following to check if my code works, but whenever I use a two-digit numbers such as 10 it would not print this number as the max number.
If my text file looked like this.
tom:5
tom:10
tom:1
It would output 5 as the max number.
name = input('name')
score = 4
if name == 'tom':
fo= open('tom.txt','a')
fo.write('Tom: ')
fo.write(str(score ))
fo.write("\n")
fo.close()
if name == 'wood':
fo= open('wood.txt','a')
fo.write('Wood: ')
fo.write(str(score ))
fo.write("\n")
fo.close()
tomL2 = []
woodL2 = []
fo = open('tom.txt','r')
tomL = fo.readlines()
tomLi = tomL2 + tomL
fo.close
tomLL=max(tomLi)
print(tomLL)
fo = open('wood.txt','r')
woodL = fo.readlines()
woodLi = woodL2 + woodL
fo.close
woodLL=max(woodLi)
print(woodLL)
You are comparing strings, not numbers. You need to convert them into numbers before using max. For example, you have:
tomL = fo.readlines()
This contains a list of strings:
['tom:5\n', 'tom:10\n', 'tom:1\n']
Strings are ordered lexicographically (much like how words would be ordered in an English dictionary). If you want to compare numbers, you need to turn them into numbers first:
tomL_scores = [int(s.split(':')[1]) for s in tomL]
The parsing is done in the following way:
….split(':') separates the string into parts using a colon as the delimiter:
'tom:5\n' becomes ['tom', '5\n']
…[1] chooses the second element from the list:
['tom', '5\n'] becomes '5\n'
int(…) converts a string into an integer:
'5\n' becomes 5
The list comprehension [… for s in tomL] applies this sequence of operations to every element of the list.
Note that int (or similarly float) are rather picky about what it accepts: it must be in the form of a valid numeric literal or it will be rejected with an error (although preceding and trailing whitespace is allowed). This is why you need ….split(':')[1] to massage the string into a form that it's willing to accept.
This will yield:
[5, 10, 1]
Now, you can apply max to obtain the largest score.
As a side-note, the statement
fo.close
will not close a file, since it doesn't actually call the function. To call the function you must enclose the arguments in parentheses, even if there are none:
fo.close()

Hadoop Pig: Extract all substrings matching a given regular expression

I am parsing some data of the form:
(['L123', 'L234', 'L1', 'L253764'])
(['L23', 'L2'])
(['L5'])
...
where the phrases inside the parens, including the brackets, are encoded as a single chararray.
I want to extract just the L+(digits) tags to obtain tuples of the form:
((L123, L234, L1, L253764))
((L23, L2))
((L5))
I have tried using REGEX_EXTRACT_ALL using the regular expression '(L\d+)', but it only seems to extract a single tag per line, which is useless to me. Is there a way to create tuples in the way I have described above?
If order does not matter, then this will work:
-- foo is the tuple, and bar is the name of the chararray
B = FOREACH A GENERATE TOKENIZE(foo.bar, ',') AS values: {T: (value: chararray)} ;
C = FOREACH B {
clean_values = FOREACH values GENERATE
REGEX_EXTRACT(value, '(L[0-9]+)', 1) AS clean_value: chararray ;
GENERATE clean_values ;
}
The schema and output are:
C: {clean_values: {T: (clean_value: chararray)}}
({(L123),(L234),(L1),(L253764)})
({(L23),(L2)})
({(L5)})
Generally, if you don't know how many elements the array will have then a bag will be better.

Convert list of one number to int

I have a regular expression that parses a line# string from a log. That line# is then subjected to another regular expression to just extract the line#.
For example:
Part of this regex:
m = re.match(r"^(\d{4}-\d{2}-\d{2}\s*\d{2}:\d{2}:\d{2}),?(\d{3}),?(?:\s+\[(?:[^\]]+)\])+(?<=])(\s+?[A-Z]+\s+?)+(\s?[a-zA-Z0-9\.])+\s?(\((?:\s?\w)+\))\s?(\s?.)+", line)
Will match this:
(line 206)
Then this regex:
re.findall(r'\b\d+\b', linestr)
Gives me
['206']
In order to further process my information I need to have the line number as an integer and am lost for a solution as to how to do that.
You may try:
line_int = int(re.findall(r'\b\d+\b', linestr)[0])
or if you have more than one element in the list:
lines_int = [int(i) for i in re.findall(r'\b\d+\b', linestr)]
or even
lines_int = map(int, re.findall(r'(\b\d+\b)+', linestr))
I hope it helps -^.^-
Use int() to convert your list of one "string number" to an int:
myl = ['206']
int(myl[0])
206
if you have a list of these, you can conver them all to ints using list comprehension:
[int(i) for i in myl]
resulting in a list of ints.
You can hook this into your code as best fits, e.g.,
int(re.findall(r'\b\d+\b', linestr)[0])
int(re.findall(r'\b\d+\b', linestr)[0])
?