Pyspark-length of an element and how to use it later - python-2.7

So I have a dataset of words, I try to keep only those that are longer than 6 characters:
data=dataset.map(lambda word: word,len(word)).filter(len(word)>=6)
When:
print data.take(10)
it returns all of the words, including the first 3, which have length lower than 6. I dont actually want to print them, but to continue working on the data that have length greater than 6.
So when I will have the appropriate dataset, I would like to be able to select the data that I need, for example the ones that have length less than 15 and be able to make computations on them.
Or even to apply a function on the "word".
Any ideas??

What you want is something along this (untested):
data=dataset.map(lambda word: (word,len(word))).filter(lambda t : t[1] >=6)
In the map, you return a tuple of (word, length of word) and the filter will look at the length of word (the l) to take only the (w,l) whose l is greater or equal to 6

Related

Extract multiple substrings of numbers of a specific length from string in Google Sheets

I'd need to split or extract only numbers made of 8 digits from a string in Google Sheets.
I've tried with SPLIT or REGEXREPLACE but I can't find a way to get only the numbers of that length, I only get all the numbers in the string!
For example I'm using
=SPLIT(lower(N2),"qwertyuiopasdfghjklzxcvbnm`-=[]\;' ,./!:##$%^&*()")
but I get all the numbers while I only need 8 digits numbers.
This may be a test value:
00150412632BBHBBLD 12458 32354 1312548896 ACT inv 62345471
I only need to extract "62345471" and nothing else!
Could you please help me out?
Many thanks!
Please use the following formula for a single cell.
Drag it down for more cells.
=INDEX(TRANSPOSE(QUERY(TRANSPOSE(IF(LEN(SPLIT(REGEXREPLACE(A2&" ","\D+"," ")," "))=8,
SPLIT(REGEXREPLACE(A2&" ","\D+"," ")," "),"")),"where Col1 is not null ",0)))
Functions used:
QUERY
INDEX
TRANSPOSE
IF
LEN
SPLIT
REGEXREPLACE
If you only need to do this for one cell (or you have your heart set on dragging the formula down into individual cells), use the following formula:
=REGEXEXTRACT(" "&N2&" ","\s(\d{8})\s")
However, I suspect you want to process the eight-digit number out of all cells running N2:N. If that is the case, clear whatever will be your results column (including any headers) and place the following in the top cell of that otherwise cleared results column:
=ArrayFormula({"Your Header"; IF(N2:N="",,IFERROR(REGEXEXTRACT(" "&N2:N&" ","\s(\d{8})\s")))})
Replace the header text Your Header with whatever you want your actual header text to be. The formula will show that header text and will return all results for all rows where N2:N is not null. Where no eight-digit number is found, null will be returned.
By prepending and appending a space to the N2:N raw strings before processing, spaces before and after string components can be used to determine where only eight digits exist together (as opposed to eight digits within a longer string of digits).
The only assumption here is that there are, in fact, spaces between string components. I did not assume that the eight-digit number will always be in a certain position (e.g., first, last) within the string.
Try this, take a look at Example sheet
=FILTER(TRANSPOSE(SPLIT(B2," ")),LEN(TRANSPOSE(SPLIT(B2," ")))=8)
Or this to get them all.
=JOIN(" ,",FILTER(TRANSPOSE(SPLIT(B2," ")),LEN(TRANSPOSE(SPLIT(B2," ")))=8))
Explanation
SPLIT with the dilimiter set to " " space TRANSPOSE and FILTER TRANSPOSE(SPLIT(B2," ") with the condition1 set to LEN(TRANSPOSE(SPLIT(B2," "))) is = 8
JOIN the outputed column whith " ," to gat all occurrences of number with a length of 8
Note: to get the numbers with the length of N just replace 8 in the FILTER function with a cell refrence.
Using this on a cell worked just fine for me:
(cell_with_data)=REGEXEXTRACT(A1,"[0-9]{8}$")

Power BI - extract number from text string based on conditions

I have done extensive searching and I don't believe this is a repeat, but is definitely and extension of previous questions. I am attempting to extract numbers from a text string within a Power BI function. I have successfully extracted the numbers from the string into a value using the below:
Text.Combine(
List.RemoveNulls(
List.Transform(
Text.ToList([string_col]),
each if Value.Is(Value.FromText(_), type number)
then _ else null)
)
)
Using this code works great when the number I am interested in is the only number in the string, for example:
"Bring on the 1234567 comments" results in 1234567
However, I can't resolve extracting my number when multiple different numbers occur in the string, for example:
"Bring on on the 1234567 comments with 50 telling me this is a repeat" results in 123456750
What I need to do is one pull the number within the string that meets conditions (one in my case). For my particular issue, the number I need to extract will always be the only 7 digit number in the string, so I feel like this should be a more straight forward answer?
Is there a way to extract only the 7 digit number using my provided function or something similar? If I am way off base, can someone please set me on the proper path?
As always, the communities help is greatly appreciated.
Diedrich
First, you could use the Text.Select function to extract all numbers.
FirstStep =
Table.AddColumn(Source, "MyNumberColumn", each Text.Select([MyStringColumn], {"0".."9"}))
I found this solution on this blog post from Erik Svensen:
https://eriksvensen.wordpress.com/2018/03/06/extraction-of-number-or-text-from-a-column-with-both-text-and-number-powerquery-powerbi
For your specific requirement, maybe you need to column type the NumberColumn as text:
FirstStep =
Table.AddColumn(
Source,
"MyTempNumberColumn",
each Text.Select([MyStringColumn], {"0".."9"}),
type text)
From there, depending on the length of the result you could test presence of seven characters sequence in original string, as many times as needed until you reach the end of the new sequence made only of numbers.
SecondStep=
Table.AddColumn("My7numbers",
each if Text.Length([MyNumberColumn]) = 7
then [MyTempNumberColumn]
else if
Text.Contains([MyStringColumn],
Text.Range([MyTempNumberColumn], 0, 7))
then
Text.Range([MyTempNumberColumn], 0, 7))
else if
Text.Contains([MyStringColumn],
Text.Range([MyTempNumberColumn], 1, 7)
then
Text.Range([MyTempNumberColumn], 1, 7))
Depending on how many numbers you can get, it might worth trying to use Liste.Generate in a function that would give a list of every 7 figures sequences from [MyTempNumberColumn], whatever its length.
https://learn.microsoft.com/en-us/powerquery-m/list-generate

Retrieving count with leading zeros

I am retrieving to fetch the count of rows which I gor from SQL Server using a column. I am achieving it through
xsl:value-of select='format-number(count(//Report/ABC_Data/Details_Collection/Details/Sequence_Number),"#")'
But the issue is, I need it for 6 characters & if the count is just 2 digits, say 62, I need it as 000062. Any help on this please ?
Also is there a way to add two nodes (And pad it with leading zero's : length is 20)?
I am trying as
xsl:value-of select='format-number(sum(//Report/ABC_Data/Details_Collection/Details/Initial_Amount|Final_Amount),"$#.00")'>
Your number pattern needs to look like 000000 rather than using # characters:
format-number(62, '000000')
gives 000062.
You can use a node-set in sum: try something like:
sum((//Report/ABC_Data/Details_Collection/Details/Initial_Amount, //Report/ABC_Data/Details_Collection/Details/Final_Amount))
The outer set of brackets is for the sum function; the inner set is to define a (comma separated) node set sequence. Note that each XPath term could theoretically return a sequence of more than one value!

Is it possible for a regex to identify all equal value subarrays?

An equal value subarray is a subarray containing one or more consecutive elements of the same value.
For example, lets say our array is:
1,1,3
There are four equal value sub-arrays:
[1], [1], [3], [1,1]
Note that elements can be part of more than one subarray.
I know [\d] matches digits, but this requirement is failing me. I am asking regex solution out of curiosity.
There's no way to do this with one regex. In fact, I recommend that you use more than one version of the string.
This regex should work:
^(\d+)(,\1){n}
I've made some adjustments to ensure a more robust regex:
Allows for numbers greater than 10
Will only match at the start, ensuring the count is not thrown off
For an array of length 4, you should replace n with 0, 1, 2, 3. This means that you will have to match against four regexes.
(Note that n=0 is the same as ^(\d+))
Furthermore, you will have to "behead" the string, meaning that you would first match against 1,1,1,3 (new example) and then 1,1,3, and then 1,3, and then 3.
Fun fact: you can use a regex to behead the string (group 1 will have the beheaded string):
^\d+,(.*)
(Obviously, you will need to ensure that you're not trying to behead an array of size 1.)
For an array of size 4, you will need to match against 4+3+2+1=10 regexes. You should test to see if the regex matched; if it did, you know to increment your count by 1. (Note that 10 is the maximum number of consecutive combinations for an array of 4.)
Here's an explanation of why you need to use more than one string. Take this regex:
(\d)(,?\1){n}
Again, n needs to be replaced. You would also need to use the g modifier (or its equivalent).
I'll use your example of 1,1,1,1:
n=0 gives 4 matches
n=1 gives 2 matches
n=2 gives 1 match
n=3 gives 1 match
As you can see, it does not handle overlapping matches very well, because that's not how regex was designed.

Every other letter

So, I have tried this problem for what it seems like a hundred times this week alone.
It's filling in the blank for the following program...
You entered jackson and ville.
When these are combined, it makes jacksonville.
Taking every other letter gives us jcsnil.
The blanks I have filled are fine, but the rest of the blanks, I can't figure out. Here they are.
x = raw_input("Enter a word: ")
y = raw_input("Enter another word: ")
print("You entered %s and %s." % (x,y))
combined = x + y
print("When these are combined, it makes %s." % combined)
every_other = ""
counter = 0
for __________________ :
if ___________________ :
every_other = every_other + letter
____________
print("Taking every other letter gives us %s." % every_other)
I just need three blanks to this program. This is basic python, so nothing too complicated or something I can match wit the twenty options. Please, I appreciate your help!
The first blank needs to define letter so that each time through the loop it is the letter at position counter in combined
The second blank needs to test for the current letter's position being one that gets included
The last blank needs to modify counter for the next value of letter (much as the initial value of counter was for the first letter).
The solution is to slice with a step value.
In [10]: "jacksonville"[::2]
Out[10]: 'jcsnil'
The slice notation means "take the subset starting at the beginning of the iterable, ending at the end of the iterable, selecting every second element". Remember that Python slices start by selecting the first element available in the slice.
EDIT: Didn't realize it had to fill in the blanks
for letter in combined:
if(counter % 2) == 0:
every_other = every_other + letter
counter += 1
Since taking every other would mean you take every second letter, or every second pass through the loop, and you use counter to track how many passes you've made, you can use modulo division (%) to check when to take a letter. The base case is that 0 % 2 = 0, which lets you take the first letter. It's important to remember to always increment the counter.
A way to do this without the manual counter, which was already mentioned in comments is to use the enumerate function on combined. When given an iterable as a parameter, enumerate returns a generator which yields two values with each request, the position in the iterable, and the value of the iterable at that position.
I'm using this language, talking about iterables as index-able sequences, but it could be any generator-like object which doesn't have to have a finite, pre-defined sequence.