Python 在 Pandas 数据框列中查找最长字符串的长度
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/21295334/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Find length of longest string in Pandas dataframe column
提问by ebressert
Is there a faster way to find the length of the longest string in a Pandas DataFrame than what's shown in the example below?
在 Pandas DataFrame 中,是否有比以下示例中所示更快的方法来查找最长字符串的长度?
import numpy as np
import pandas as pd
x = ['ab', 'bcd', 'dfe', 'efghik']
x = np.repeat(x, 1e7)
df = pd.DataFrame(x, columns=['col1'])
print df.col1.map(lambda x: len(x)).max()
# result --> 6
It takes about 10 seconds to run df.col1.map(lambda x: len(x)).max()when timing it with IPython's %timeit.
它需要大约10秒运行df.col1.map(lambda x: len(x)).max()与IPython中的时间,当它%timeit。
采纳答案by Marius
DSM's suggestion seems to be about the best you're going to get without doing some manual microoptimization:
DSM 的建议似乎是在不进行一些手动微优化的情况下您将获得的最佳结果:
%timeit -n 100 df.col1.str.len().max()
100 loops, best of 3: 11.7 ms per loop
%timeit -n 100 df.col1.map(lambda x: len(x)).max()
100 loops, best of 3: 16.4 ms per loop
%timeit -n 100 df.col1.map(len).max()
100 loops, best of 3: 10.1 ms per loop
Note that explicitly using the str.len()method doesn't seem to be much of an improvement. If you're not familiar with IPython, which is where that very convenient %timeitsyntax comes from, I'd definitely suggest giving it a shot for quick testing of things like this.
请注意,明确使用该str.len()方法似乎没有太大的改进。如果您不熟悉 IPython,这是非常方便的%timeit语法的来源,我绝对建议您尝试一下,以便快速测试此类内容。
UpdateAdded screenshot:
更新添加截图:
回答by Ricky McMaster
Just as a minor addition, you might want to loop through all object columns in a data frame:
作为一个小补充,您可能希望遍历数据框中的所有对象列:
for c in df:
if df[c].dtype == 'object':
print('Max length of column %s: %s\n' % (c, df[c].map(len).max()))
This will prevent errors being thrown by bool, int types etc.
这将防止 bool、int 类型等抛出错误。
Could be expanded for other non-numeric types such as 'string_', 'unicode_' i.e.
可以扩展为其他非数字类型,例如 'string_'、'unicode_' 即
if df[c].dtype in ('object', 'string_', 'unicode_'):
回答by Acumenus
Sometimes you want the length of the longest string in bytes. This is relevant for strings that use fancy Unicode characters, in which case the length in bytes is greater than the regular length. This can be very relevant in specific situations, e.g. for database writes.
有时您需要最长字符串的长度(以字节为单位)。这与使用花哨 Unicode 字符的字符串相关,在这种情况下,字节长度大于常规长度。这在特定情况下非常重要,例如数据库写入。
df_col_len = int(df[df_col_name].str.encode(encoding='utf-8').str.len().max())
The above line has the extra str.encode(encoding='utf-8'). The output is enclosed in int()because it is otherwise a numpy object.
上面的行有额外的str.encode(encoding='utf-8'). 输出被括起来,int()因为它是一个 numpy 对象。
回答by Azhar Ansari
You should try using numpy. This could also help you get efficiency improvements.
您应该尝试使用numpy。这也可以帮助您提高效率。
The below code will give you maximum lengths for each column in an excel spreadsheet (read into a dataframe using pandas)
以下代码将为您提供 Excel 电子表格中每列的最大长度(使用 Pandas 读入数据框)
import pandas as pd
import numpy as np
xl = pd.ExcelFile('sample.xlsx')
df = xl.parse('Sheet1')
columnLenghts = np.vectorize(len)
maxColumnLenghts = columnLenghts(df.values.astype(str)).max(axis=0)
print('Max Column Lengths ', maxColumnLenghts)
回答by jabberwocky
Excellent answers, in particular Marius and Ricky which were very helpful.
优秀的答案,特别是 Marius 和 Ricky,他们非常有帮助。
Given that most of us are optimising for coding time, here is a quick extension to those answers to return all the columns' max item length as a series, sorted by the maximum item length per column:
鉴于我们大多数人都在优化编码时间,这里是对这些答案的快速扩展,将所有列的最大项目长度作为一个系列返回,按每列的最大项目长度排序:
mx_dct = {c: df[c].map(lambda x: len(str(x))).max() for c in df.columns}
pd.Series(mx_dct).sort_values(ascending =False)
Or as a one liner:
或者作为一个班轮:
pd.Series({c: df[c].map(lambda x: len(str(x))).max() for c in df).sort_values(ascending =False)
Adapting the original sample, this can be demoed as:
改编原始样本,这可以演示为:
import pandas as pd
x = [['ab', 'bcd'], ['dfe', 'efghik']]
df = pd.DataFrame(x, columns=['col1','col2'])
print(pd.Series({c: df[c].map(lambda x: len(str(x))).max() for c in df}).sort_values(ascending =False))
Output:
输出:
col2 6
col1 3
dtype: int64


