pandas 如何使用pandas python获取数据框中每列的最大长度

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/50339065/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:33:49  来源:igfitidea点击:

How to get maximum length of each column in the data frame using pandas python

pythonpython-3.xpandasdataframeseries

提问by singularity2047

I have a data frame where most of the columns are varchar/object type. Length of the column varies a lot and could be anything within the range of 3 - 1000+ . Now, for each column, I want to measure maximum length.

我有一个数据框,其中大部分列都是 varchar/object 类型。列的长度变化很大,可以是 3 - 1000+ 范围内的任何值。现在,对于每一列,我想测量最大长度。

I know how to calculate maximum length for a col. If its varchar then:

我知道如何计算 col 的最大长度。如果它的 varchar 则:

max(df.char_col.apply(len))

and if its number (float8 or int64) then:

如果它的数字(float8 或 int64)则:

max(df.num_col.map(str).apply(len))

But my dataframe has hundreds of column and I want to calculate maximum length for all columns at the same time. The problem for that is, there are different data types, and I dont know how to do all at once.

但是我的数据框有数百列,我想同时计算所有列的最大长度。问题是,有不同的数据类型,我不知道如何一次完成。

So Question 1: How to get maximum column length for each columns in the data frame

所以问题1:如何获得数据框中每一列的最大列长

Now I am trying to do that only for varchar/object type columns using following code:

现在我正在尝试使用以下代码仅对 varchar/object 类型列执行此操作:

xx = df.select_dtypes(include = ['object'])
for col in [xx.columns.values]:
   maxlength = [max(xx.col.apply(len))]

I selected only object type columns and tried to write a for loop. But its not working. probably using apply() within for loop is not a good idea.

我只选择了对象类型列并尝试编写一个 for 循环。但它不工作。可能在 for 循环中使用 apply() 不是一个好主意。

Question 2: How to get maximum length of each column for only object type columns

问题 2:如何仅获取对象类型列的每列的最大长度

Sample data frame:

示例数据框:

d1 = {'name': ['john', 'tom', 'bob', 'rock', 'jimy'], 'DoB': ['01/02/2010', '01/02/2012', '11/22/2014', '11/22/2014', '09/25/2016'], 'Address': ['NY', 'NJ', 'PA', 'NY', 'CA'], 'comment1': ['Very good performance', 'N/A', 'Need to work hard', 'No Comment', 'Not satisfactory'], 'comment2': ['good', 'Meets Expectation', 'N', 'N/A', 'Incompetence']}
df1 = pd.DataFrame(data = d1)
df1['month'] = pd.DatetimeIndex(df1['DoB']).month
df1['year'] = pd.DatetimeIndex(df1['DoB']).year

回答by jpp

One solution is to use numpy.vectorize. This may be more efficient than pandas-based solutions.

一种解决方案是使用numpy.vectorize. 这可能比pandas基于的解决方案更有效。

You can use pd.DataFrame.select_dtypesto select objectcolumns.

您可以使用pd.DataFrame.select_dtypes来选择object列。

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': ['abc', 'de', 'abcd'],
                   'B': ['a', 'abcde', 'abc'],
                   'C': [1, 2.5, 1.5]})

measurer = np.vectorize(len)

Max length for all columns

所有列的最大长度

res1 = measurer(df.values.astype(str)).max(axis=0)

array([4, 5, 3])

Max length for object columns

对象列的最大长度

res2 = measurer(df.select_dtypes(include=[object]).values.astype(str)).max(axis=0)

array([4, 5])


Or if you need output as a dictionary:

或者,如果您需要将输出作为字典:

res1 = dict(zip(df, measurer(df.values.astype(str)).max(axis=0)))

{'A': 4, 'B': 5, 'C': 3}

df_object = df.select_dtypes(include=[object])
res2 = dict(zip(df_object, measurer(df_object.values.astype(str)).max(axis=0)))

{'A': 4, 'B': 5}

回答by alif

Some great answers here and I would like to contribute mine

这里有一些很棒的答案,我想贡献我的

Solution:

解决方案

dict([(v, df[v].apply(lambda r: len(str(r)) if r!=None else 0).max())for v in df.columns.values])

Explanation:

说明

#convert tuple to dictionary
dict( 
    [
        #create a tuple such that (column name, max length of values in column)
        (v, df[v].apply(lambda r: len(str(r)) if r!=None else 0).max()) 
            for v in df.columns.values #iterates over all column values
    ])

Sample output

样本输出

{'name': 4, 'DoB': 10, 'Address': 2, 'comment1': 21, 'comment2': 17}

回答by Osmond Bishop

Select only object type columns

仅选择对象类型列

df2 = df1[[x for x in df1 if df1[x].dtype == 'O']]

Get the maximum length in each column

获取每列的最大长度

max_length_in_each_col = df2.applymap(lambda x: len(x)).max()

回答by Azhar Ansari

I tried numpy.vectorizebut it gave 'Memory Error'for huge dataframe.

我尝试了numpy.vectorize但它为巨大的数据帧提供了“内存错误”

The below code worked perfectly for me. It will give you a list of maximum lengths for each column in an excel spreadsheet (read into a dataframe using pandas)

下面的代码非常适合我。它将为您提供 Excel 电子表格中每列的最大长度列表(使用 Pandas 读入数据框)

import pandas as pd

xl = pd.ExcelFile('sample.xlsx')
df = xl.parse('Sheet1')

maxColumnLenghts = []
for col in range(len(df.columns)):
    maxColumnLenghts.append(max(df.iloc[:,col].astype(str).apply(len)))
print('Max Column Lengths ', maxColumnLenghts)

回答by MSallal

You can use min max after using str and len method

您可以在使用 str 和 len 方法后使用 min max

df["A"].str.len().max()
df["A"].str.len().min()

df["Column Name"].str.len().max()
df["Column Name"].str.len().min()