pandas 如何通过分隔符拆分熊猫列并选择首选元素作为替换

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33604139/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:11:27  来源:igfitidea点击:

How to split pandas column by a delimiter and select preferred element as the replacement

pythonpandas

提问by neversaint

I have the following pandas data frame:

我有以下Pandas数据框:

import pandas as pd
df = pd.DataFrame({ 'gene':["1 // foo // blabla",
                                   "2 // bar // lalala",
                                   "3 // qux // trilil",
                                   "4 // woz // hohoho"], 'cell1':[5,9,1,7], 'cell2':[12,90,13,87]})
df = source_df[["gene","cell1","cell2"]]

It looks like this:

它看起来像这样:

                 gene  cell1  cell2
0  1 // foo // blabla      5     12
1  2 // bar // lalala      9     90
2  3 // qux // trilil      1     13
3  4 // woz // hohoho      7     87

What I want to get is this:

我想得到的是:

   gene    cell1  cell2
0   foo       5     12
1   bar       9     90
2   qux       1     13
3   woz       7     87

Namely select 2nd element of the splited string by //as delimiter.

即选择拆分字符串的第二个元素//作为分隔符。

The best I can do is this:

我能做的最好的是:

df["gene"] = df["gene"].str.split(" // ")
df
Out[17]:
               gene  cell1  cell2
0  [1, foo, blabla]      5     12
1  [2, bar, lalala]      9     90
2  [3, qux, trilil]      1     13
3  [4, woz, hohoho]      7     87

What's the right way to do it?

正确的做法是什么?

回答by EdChum

Use the vectorised str.splitthis will be much faster than using applyon a large dataset:

使用矢量化str.split这将比apply在大型数据集上使用快得多:

In [13]:
df['gene'] = df['gene'].str.split('//').str[1]
df

Out[13]:
   cell1  cell2   gene
0      5     12   foo 
1      9     90   bar 
2      1     13   qux 
3      7     87   woz 

回答by jezrael

You can use regexand strip first and last spaces by strip:

您可以通过以下方式使用regex和去除第一个和最后一个空格strip

df["gene"] = df["gene"].str.extract(r"\/\/([a-z ]+)\/\/")
df["gene"] = df["gene"].str.strip()

print df
  gene  cell1  cell2
0  foo      5     12
1  bar      9     90
2  qux      1     13
3  woz      7     87

\/\/([a-z ]+)\/\/means:

\/\/([a-z ]+)\/\/方法:

    \/ matches the character / literally
    \/ matches the character / literally
    1st Capturing group ([a-z ]+)
        [a-z ]+ match a single character present in the list below
            Quantifier: + Between one and unlimited times, as many times as possible,
            giving back as needed [greedy]
            a-z a single character in the range between a and z (case sensitive)
             the literal character  
    \/ matches the character / literally
    \/ matches the character / literally

Or regex without strip:

或不带条的正则表达式:

df["gene"] = df["gene"].str.extract(r"\/\/\s*([a-z ]+)\s\/\/")

/\/\/\s*([a-z ]+)\s\/\//means:

/\/\/\s*([a-z ]+)\s\/\//方法:

    \/ matches the character / literally
    \/ matches the character / literally
    \s* match any white space character [\r\n\t\f ]
        Quantifier: * Between zero and unlimited times, as many times as possible, 
        giving back as needed [greedy]
    1st Capturing group ([a-z ]+)
        [a-z ]+ match a single character present in the list below
            Quantifier: + Between one and unlimited times, as many times as possible, 
            giving back as needed [greedy]
            a-z a single character in the range between a and z (case sensitive)
            the literal character  
    \s match any white space character [\r\n\t\f ]
    \/ matches the character / literally
    \/ matches the character / literally

回答by sohail288

You were pretty close, however selecting the element from the resulting split is a bit more difficult doing it your way.

您非常接近,但是从结果拆分中选择元素有点困难。

Here's a solution with apply

这是一个带有应用的解决方案

>>> df['gene'] = df['gene'].apply(lambda s: s.split('//')[1])
>>> df

    gene  cell1  cell2
0   foo       5     12
1   bar       9     90
2   qux       1     13
3   woz       7     87