Python:Pandas 根据字符串长度过滤字符串数据
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/19937362/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python: Pandas filter string data based on its string length
提问by notilas
I like to filter out data whose string length is not equal to 10.
我喜欢过滤掉字符串长度不等于10的数据。
If I try to filter out any row whose column A's or B's string length is not equal to 10, I tried this.
如果我尝试过滤掉 A 列或 B 列的字符串长度不等于 10 的任何行,我会尝试这样做。
df=pd.read_csv('filex.csv')
df.A=df.A.apply(lambda x: x if len(x)== 10 else np.nan)
df.B=df.B.apply(lambda x: x if len(x)== 10 else np.nan)
df=df.dropna(subset=['A','B'], how='any')
This works slow, but is working.
这工作缓慢,但正在工作。
However, it sometimes produce error when the data in A is not a string but a number (interpreted as a number when read_csv read the input file).
但是,当A中的数据不是字符串而是数字(read_csv读取输入文件时解释为数字)时,有时会产生错误。
File "<stdin>", line 1, in <lambda>
TypeError: object of type 'float' has no len()
I believe there should be more efficient and elegant code instead of this.
我相信应该有更高效、更优雅的代码而不是这个。
Based on the answers and comments below, the simplest solution I found are:
根据下面的答案和评论,我找到的最简单的解决方案是:
df=df[df.A.apply(lambda x: len(str(x))==10]
df=df[df.B.apply(lambda x: len(str(x))==10]
or
或者
df=df[(df.A.apply(lambda x: len(str(x))==10) & (df.B.apply(lambda x: len(str(x))==10)]
or
或者
df=df[(df.A.astype(str).str.len()==10) & (df.B.astype(str).str.len()==10)]
采纳答案by unutbu
import pandas as pd
df = pd.read_csv('filex.csv')
df['A'] = df['A'].astype('str')
df['B'] = df['B'].astype('str')
mask = (df['A'].str.len() == 10) & (df['B'].str.len() == 10)
df = df.loc[mask]
print(df)
Applied to filex.csv:
应用于filex.csv:
A,B
123,abc
1234,abcd
1234567890,abcdefghij
the code above prints
上面的代码打印
A B
2 1234567890 abcdefghij
回答by przemo_li
If You have numbers in rows, then they will convert as floats.
如果您在行中有数字,那么它们将转换为浮点数。
Convert all the rows to strings after importing from cvs. For better performance split that lambdas into multiple threads.
从 cvs 导入后将所有行转换为字符串。为了获得更好的性能,将 lambda 拆分为多个线程。
回答by Mahdi Ghelichi
A more Pythonic way of filtering out rows based on given conditions of other columns and their values:
根据其他列的给定条件及其值过滤行的更 Pythonic 方式:
Assuming a df of:
假设 df 为:
data={"names":["Alice","Zac","Anna","O"],"cars":["Civic","BMW","Mitsubishi","Benz"],
"age":["1","4","2","0"]}
df=pd.DataFrame(data)
df:
age cars names
0 1 Civic Alice
1 4 BMW Zac
2 2 Mitsubishi Anna
3 0 Benz O
Then:
然后:
df[
df['names'].apply(lambda x: len(x)>1) &
df['cars'].apply(lambda x: "i" in x) &
df['age'].apply(lambda x: int(x)<2)
]
We will have :
我们将有 :
age cars names
0 1 Civic Alice
In the conditions above we are looking first at the length of strings, then we check whether a letter ("i") exists in the strings or not, finally, we check for the value of integers in the first column.
在上面的条件中,我们首先查看字符串的长度,然后检查字符串中是否存在字母(“i”),最后检查第一列中整数的值。
回答by Vishal Suryavanshi
you can use df.apply(len)
. it will give you the result
你可以使用df.apply(len)
. 它会给你结果
回答by spongebob
I personally found this way to be the easiest:
我个人认为这种方式是最简单的:
df['column_name'] = df[df['column_name'].str.len()!=10]