pandas 如何通过分隔符拆分熊猫列并选择首选元素作为替换

Question

提问by neversaint

I have the following pandas data frame:

我有以下Pandas数据框：

import pandas as pd
df = pd.DataFrame({ 'gene':["1 // foo // blabla",
                                   "2 // bar // lalala",
                                   "3 // qux // trilil",
                                   "4 // woz // hohoho"], 'cell1':[5,9,1,7], 'cell2':[12,90,13,87]})
df = source_df[["gene","cell1","cell2"]]

It looks like this:

它看起来像这样：

                 gene  cell1  cell2
0  1 // foo // blabla      5     12
1  2 // bar // lalala      9     90
2  3 // qux // trilil      1     13
3  4 // woz // hohoho      7     87

What I want to get is this:

我想得到的是：

   gene    cell1  cell2
0   foo       5     12
1   bar       9     90
2   qux       1     13
3   woz       7     87

Namely select 2nd element of the splited string by //as delimiter.

即选择拆分字符串的第二个元素//作为分隔符。

The best I can do is this:

我能做的最好的是：

df["gene"] = df["gene"].str.split(" // ")
df
Out[17]:
               gene  cell1  cell2
0  [1, foo, blabla]      5     12
1  [2, bar, lalala]      9     90
2  [3, qux, trilil]      1     13
3  [4, woz, hohoho]      7     87

What's the right way to do it?

正确的做法是什么？

Answer 1

回答by EdChum

Use the vectorised str.splitthis will be much faster than using applyon a large dataset:

使用矢量化str.split这将比apply在大型数据集上使用快得多：

In [13]:
df['gene'] = df['gene'].str.split('//').str[1]
df

Out[13]:
   cell1  cell2   gene
0      5     12   foo 
1      9     90   bar 
2      1     13   qux 
3      7     87   woz

Answer 2

回答by jezrael

You can use regexand strip first and last spaces by strip:

您可以通过以下方式使用regex和去除第一个和最后一个空格strip：

df["gene"] = df["gene"].str.extract(r"\/\/([a-z ]+)\/\/")
df["gene"] = df["gene"].str.strip()

print df
  gene  cell1  cell2
0  foo      5     12
1  bar      9     90
2  qux      1     13
3  woz      7     87

\/\/([a-z ]+)\/\/means:

\/\/([a-z ]+)\/\/方法：

    \/ matches the character / literally
    \/ matches the character / literally
    1st Capturing group ([a-z ]+)
        [a-z ]+ match a single character present in the list below
            Quantifier: + Between one and unlimited times, as many times as possible,
            giving back as needed [greedy]
            a-z a single character in the range between a and z (case sensitive)
             the literal character  
    \/ matches the character / literally
    \/ matches the character / literally

Or regex without strip:

或不带条的正则表达式：

df["gene"] = df["gene"].str.extract(r"\/\/\s*([a-z ]+)\s\/\/")

/\/\/\s*([a-z ]+)\s\/\//means:

/\/\/\s*([a-z ]+)\s\/\//方法：

    \/ matches the character / literally
    \/ matches the character / literally
    \s* match any white space character [\r\n\t\f ]
        Quantifier: * Between zero and unlimited times, as many times as possible, 
        giving back as needed [greedy]
    1st Capturing group ([a-z ]+)
        [a-z ]+ match a single character present in the list below
            Quantifier: + Between one and unlimited times, as many times as possible, 
            giving back as needed [greedy]
            a-z a single character in the range between a and z (case sensitive)
            the literal character  
    \s match any white space character [\r\n\t\f ]
    \/ matches the character / literally
    \/ matches the character / literally

Answer 3

回答by sohail288

You were pretty close, however selecting the element from the resulting split is a bit more difficult doing it your way.

您非常接近，但是从结果拆分中选择元素有点困难。

Here's a solution with apply

这是一个带有应用的解决方案

>>> df['gene'] = df['gene'].apply(lambda s: s.split('//')[1])
>>> df

    gene  cell1  cell2
0   foo       5     12
1   bar       9     90
2   qux       1     13
3   woz       7     87

pandas 如何通过分隔符拆分熊猫列并选择首选元素作为替换

提问by neversaint

回答by EdChum

回答by jezrael

回答by sohail288

相关推荐

最近更新

标签

pandas 如何通过分隔符拆分熊猫列并选择首选元素作为替换

提问by neversaint

回答by EdChum

回答by jezrael

回答by sohail288

相关推荐

Pandas 数据框应用参考前一行来计算差异

来自 Pandas 数据框的 seaborn 时间序列

如何在 Pandas 数据框中选择基于行的类别

Pandas - 给定特定 b 的条件概率

相关推荐

最近更新

标签