Remove spaces and \n from inside the list python - list

This is the list I am getting:
['', '', ' NRGD\n ', '\n MicroSectors U.S. Big Oil Index -3X Inverse Leveraged ETN\n ', ' $102.24\n ', ' 5012.00%\n \n2070.00', '\n ']
I want to "clean it up" and return:
['NRGD', 'MicroSectors U.S. Big Oil Index -3X Inverse Leveraged ETN', '$102.24', '5012.00%', '2070.00']
I want to basically remove all the items that are just spaces or \n as for the ones with actual text I want to remove the spaces and \n and just have the item with text.

We can use a list comprehension here:
inp = ['', '', ' NRGD\n ', '\n MicroSectors U.S. Big Oil Index -3X Inverse Leveraged ETN\n ', ' $102.24\n ', ' 5012.00%\n \n2070.00', '\n ']
output = [x.strip() for x in inp if x.strip()]
print(output)
This prints:
['NRGD', 'MicroSectors U.S. Big Oil Index -3X Inverse Leveraged ETN',
'$102.24', '5012.00%\n \n2070.00']
The above logic says to retain any list element which, after stripping leading and trailing whitespace, is not empty string. It then retains such elements with whitespace trimmed.

Related

Splitting/Tokenizing a sentence into string words with special conditions

I am trying to implement a tokenizer to split string of words.
The special conditions I have are: split punctuation . , ! ? into a separate string
and split any characters that have a space in them i.e. I have a dog!'-4# -> 'I', 'have', 'a' , 'dog', !, "'-4#"
Something like this.....
I don't plan on trying the nltk's package, and I have looked at re.split and re.findall, yet for both cases:
re.split = I don't know how to split out words with punctuation next to them such as 'Dog,'
re.findall = Sure it prints out all the matched string, but what about the unmatched ones?
IF you guys have any suggestions, I'd be very happy to try them.
Are you trying to split on a delimiter(punctuation) while keeping it in the final results? One way of doing that would be this:
import re
import string
sent = "I have a dog!'-4#"
punc_Str = str(string.punctuation)
print(re.split(r"([.,;:!^ ])", sent))
This is the result I get.
['I', ' ', 'have', ' ', 'a', ' ', 'dog', '!', "'-4#"]
Try:
re.findall(r'[a-z]+|[.!?]|(?:(?![.!?])\S)+', txt, re.I)
Alternatives in the regex:
[a-z]+ - a non-empty sequence of letters (ignore case),
[.!?] - any (single) char from your list (note that between brackets
neither a dot nor a '?' need to be quoted),
(?:(?![.!?])\S)+ - a non-empty sequence of non-white characters,
other than in your list.
E.g. for text containing I have a dog!'-4#?. the result is:
['I', 'have', 'a', 'dog', '!', "'-4#", '?', '.']

python3.x regex split chops off first character

Not sure whats wrong with my regex expressions or why its chopping off the first character. The regex correctly IDs what i want to split on, but why is the first character missing in each element of the array?
>>> f = "value: http://ec2-xxx-xxx-xxx-xxx.compute-1.amazonaws.com:user-services-http/ssoeproxy/logout value: http://ec2-xxx-xxx-xxx-xxx.compute-1.amazonaws.com:user-services-http-two/ssoeproxy/logout value: user-services-http #458930 value: user-services-http-two #458930"
>>> re.split(r'[a-z0-9]([-a-z0-9]*[a-z0-9])?', f)
>>> ['', 'alue', ': ', 'ttp', '://', 'c2-xxx-xxx-xxx-xxx', '.', 'ompute-1', '.', 'mazonaws', '.', 'om', ':', 'ser-services-http', '/', 'soeproxy', '/', 'ogout', ' ', 'alue', ': ', 'ttp', '://', 'c2-xxx-xxx-xxx-xxx', '.', 'ompute-1', '.', 'mazonaws', '.', 'om', ':', 'ser-services-http-two', '/', 'soeproxy', '/', 'ogout', ' ', 'alue', ': ', 'ser-services-http', ' #', '58930', ' ', 'alue', ': ', 'ser-services-http-two', ' #', '58930', '']
A more detailed explanation of your problem here is that split() will split on whatever group you're capturing if you only specify one capture group. It won't split on your whole regular expression. In this case you're capturing everything but the first letter. [a-z0-9] is outside your parentheses. Move your parentheses to include this part and you're good to go.

Replace Leading and Trailing Spaces Oracle

I have a column in database table in Oracle having values with leading and trailing spaces. I would like the leading spaces to be replaced with 'P' and trailing spaces with 'T', using Inline Query only.
If you want to replace each leading/training space with an equal number of P/Ts then you can use:
SELECT REPLACE( REGEXP_SUBSTR( your_column, '^ +' ), ' ', 'P' )
|| TRIM( BOTH FROM your_column )
|| REPLACE( REGEXP_SUBSTR( your_column, ' +$' ), ' ', 'T' )
FROM your_table
If you want to replace the spaces with a single P/T then:
SELECT REGEXP_REPLACE(
REGEXP_REPLACE( your_column, '(.*?) +$', '\1T' ),
'^ +(.*)',
'P\1'
)
FROM your_table
Since you didn't specify if the amount of your leading and trailing spaces are all of a constant length, something like this could be used only if they are:
select replace(substr(' hello world ',1, instr(' hello world ',' ',1,2) ),' ','P')||
trim(' hello world ')||
replace(substr(' hello world ', instr(' hello world ',' ',-1,2), length(' hello world ') ),' ','T')
from dual;
Note that, the number "2" in all of the instr functions within the query, would represent the constant number of leading/trailing spaces, so you should change it to suit your need.

Regex: Match Empty \s inside [' ']

Hello I'm trying to match the white space character(\s) to could be any of this [\r\n\t\f ] values only inside the text the [' ']
For example
$lang['some random text'] = 'some random text';
$lang['other random text'] = 'other random text';
I'm looking to replace white space for _. For example the above example will end in the following format.
$lang['some_random_text'] = 'some random text';
$lang['other_random_text'] = 'other random text';
Language: Plain regex
Could someone explain what will be the right approach ?
Thanks!

PostgreSQL regexp_replace() to keep just one whitespace

I need to clean up a string column with both whitespaces and tabs included within, at the beginning or at the end of strings (it's a mess !). I want to keep just one whitespace between each word. Say we have the following string that includes every possible situation :
mystring = ' one two three four '
2 whitespaces before 'one'
1 whitespace between 'one' and 'two'
4 whitespaces between 'two' and 'three'
2 tabs after 'three'
1 tab after 'four'
Here is the way I do it :
I delete leading and trailing whitespaces
I delete leading and trailing tabs
I replace both 'whitespaces repeated at least two' and tabs by a sole whitespace
WITH
t1 AS (SELECT' one two three four '::TEXT AS mystring),
t2 AS (SELECT TRIM(both ' ' from mystring) AS mystring FROM t1),
t3 AS (SELECT TRIM(both '\t' from mystring) AS mystring FROM t2)
SELECT regexp_replace(mystring, '(( ){2,}|\t+)', ' ', 'g') FROM t3 ;
I eventually get the following string, which looks nice but I still have a trailing whitespace...
'one two three four '
Any idea on doing it in a more simple way and solving this last issue ?
Many thanks !
SELECT trim(regexp_replace(col_name, '\s+', ' ', 'g')) as col_name FROM table_name;
Or In case of update :
UPDATE table_name SET col_name = trim(regexp_replace(col_name, '\s+', ' ', 'g'));
The regexp_replace is flags are described on this section of the documentation.
SELECT trim(regexp_replace(mystring, '\s+', ' ', 'g')) as mystring FROM t1;
Posting an answer in case folks don't look at comments.
Use '\s+'
Not '\\s+'
Worked for me.
It didn't work for me with trim and regexp_replace. So I came with another solution:
SELECT trim(
array_to_string(
regexp_split_to_array(' test with many spaces for this test ', E'\\s+')
, ' ')
) as mystring;
First regexp_split_to_array eliminates all spaces leaving "blanks" at the beginning and the end.
-- regexp_split_to_array output:
-- {"",test,with,many,spaces,for,this,test,""}
When using array_to_string all the ',' become spaces
-- regexp_split_to_array output ( '_' instead of spaces for viewing ):
-- _test_with_many_spaces_for_this_test_
The trim is to remove the head and tail
-- trim output ( '_' instead of spaces for viewing ):
-- test_with_many_spaces_for_this_test