在 Pandas 数据框中混洗一列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/54009400/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 06:14:53  来源:igfitidea点击:

Shuffle one column in pandas dataframe

pythonpandasnumpy

提问by Arlo Guthrie

How does one shuffle only one column of data in pandas?

如何在 Pandas 中只洗牌一列数据?

I have a Dataframe with production data that I want to load onto dev for testing. However, the data contains personally identifiable information so I want to shuffle those columns.

我有一个包含生产数据的 Dataframe,我想将其加载到 dev 进行测试。但是,数据包含个人身份信息,所以我想对这些列进行洗牌。

Columns: FirstName LastName Birthdate SSN OtherData

列:名字姓氏出生日期社会安全号码其他数据

If the original dataframe is created by read_csv and I want to translate the data into a second dataframe for sql loading but shuffle first name, last name, and SSN, I would have expected to be able to do this:

如果原始数据帧是由 read_csv 创建的,并且我想将数据转换为第二个数据帧以进行 sql 加载,但会随机播放名字、姓氏和 SSN,我希望能够做到这一点:

if devprod == 'prod':
    #do not shuffle data
    df1['HS_FIRST_NAME'] = df[4]
    df1['HS_LAST_NAME'] = df[6]
    df1['HS_SSN'] = df[8]
else:
    df1['HS_FIRST_NAME'] = np.random.shuffle(df[4])
    df1['HS_LAST_NAME'] = np.random.shuffle(df[6])
    df1['HS_SSN'] = np.random.shuffle(df[8])

However, when I try that I get the following error:

但是,当我尝试这样做时,出现以下错误:

A value is trying to be set on a copy of a slice from a DataFrame

试图在来自 DataFrame 的切片副本上设置值

采纳答案by jpp

The immediate error is a symptom of using an inadvisable approach when working with dataframes.

即时错误是在处理数据帧时使用不明智方法的症状。

np.random.shuffleworks in-place and returns None, so assigning to the output of np.random.shufflewill not work. In fact, in-place operations are rarely required, and often yield no material benefits.

np.random.shuffle就地工作并返回None,因此分配给 的输出np.random.shuffle将不起作用。事实上,就地操作很少需要,而且通常不会产生任何实质性的好处。

Here, for example, you can use np.random.permutationand use NumPy arrays via pd.Series.valuesrather than series:

例如,在这里,您可以np.random.permutation通过pd.Series.values而不是系列来使用和使用 NumPy 数组:

if devprod == 'prod':
    #do not shuffle data
    df1['HS_FIRST_NAME'] = df[4]
    df1['HS_LAST_NAME'] = df[6]
    df1['HS_SSN'] = df[8]
else:
    df1['HS_FIRST_NAME'] = np.random.permutation(df[4].values)
    df1['HS_LAST_NAME'] = np.random.permutation(df[6].values)
    df1['HS_SSN'] = np.random.permutation(df[8].values)