在 Pandas 数据框中混洗一列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/54009400/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Shuffle one column in pandas dataframe
提问by Arlo Guthrie
How does one shuffle only one column of data in pandas?
如何在 Pandas 中只洗牌一列数据?
I have a Dataframe with production data that I want to load onto dev for testing. However, the data contains personally identifiable information so I want to shuffle those columns.
我有一个包含生产数据的 Dataframe,我想将其加载到 dev 进行测试。但是,数据包含个人身份信息,所以我想对这些列进行洗牌。
Columns: FirstName LastName Birthdate SSN OtherData
列:名字姓氏出生日期社会安全号码其他数据
If the original dataframe is created by read_csv and I want to translate the data into a second dataframe for sql loading but shuffle first name, last name, and SSN, I would have expected to be able to do this:
如果原始数据帧是由 read_csv 创建的,并且我想将数据转换为第二个数据帧以进行 sql 加载,但会随机播放名字、姓氏和 SSN,我希望能够做到这一点:
if devprod == 'prod':
#do not shuffle data
df1['HS_FIRST_NAME'] = df[4]
df1['HS_LAST_NAME'] = df[6]
df1['HS_SSN'] = df[8]
else:
df1['HS_FIRST_NAME'] = np.random.shuffle(df[4])
df1['HS_LAST_NAME'] = np.random.shuffle(df[6])
df1['HS_SSN'] = np.random.shuffle(df[8])
However, when I try that I get the following error:
但是,当我尝试这样做时,出现以下错误:
A value is trying to be set on a copy of a slice from a DataFrame
试图在来自 DataFrame 的切片副本上设置值
采纳答案by jpp
The immediate error is a symptom of using an inadvisable approach when working with dataframes.
即时错误是在处理数据帧时使用不明智方法的症状。
np.random.shuffleworks in-place and returns None, so assigning to the output of np.random.shufflewill not work. In fact, in-place operations are rarely required, and often yield no material benefits.
np.random.shuffle就地工作并返回None,因此分配给 的输出np.random.shuffle将不起作用。事实上,就地操作很少需要,而且通常不会产生任何实质性的好处。
Here, for example, you can use np.random.permutationand use NumPy arrays via pd.Series.valuesrather than series:
例如,在这里,您可以np.random.permutation通过pd.Series.values而不是系列来使用和使用 NumPy 数组:
if devprod == 'prod':
#do not shuffle data
df1['HS_FIRST_NAME'] = df[4]
df1['HS_LAST_NAME'] = df[6]
df1['HS_SSN'] = df[8]
else:
df1['HS_FIRST_NAME'] = np.random.permutation(df[4].values)
df1['HS_LAST_NAME'] = np.random.permutation(df[6].values)
df1['HS_SSN'] = np.random.permutation(df[8].values)

