在 Pandas 数据框中混洗一列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/54009400/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Shuffle one column in pandas dataframe
提问by Arlo Guthrie
How does one shuffle only one column of data in pandas?
如何在 Pandas 中只洗牌一列数据?
I have a Dataframe with production data that I want to load onto dev for testing. However, the data contains personally identifiable information so I want to shuffle those columns.
我有一个包含生产数据的 Dataframe,我想将其加载到 dev 进行测试。但是,数据包含个人身份信息,所以我想对这些列进行洗牌。
Columns: FirstName LastName Birthdate SSN OtherData
列:名字姓氏出生日期社会安全号码其他数据
If the original dataframe is created by read_csv and I want to translate the data into a second dataframe for sql loading but shuffle first name, last name, and SSN, I would have expected to be able to do this:
如果原始数据帧是由 read_csv 创建的,并且我想将数据转换为第二个数据帧以进行 sql 加载,但会随机播放名字、姓氏和 SSN,我希望能够做到这一点:
if devprod == 'prod':
#do not shuffle data
df1['HS_FIRST_NAME'] = df[4]
df1['HS_LAST_NAME'] = df[6]
df1['HS_SSN'] = df[8]
else:
df1['HS_FIRST_NAME'] = np.random.shuffle(df[4])
df1['HS_LAST_NAME'] = np.random.shuffle(df[6])
df1['HS_SSN'] = np.random.shuffle(df[8])
However, when I try that I get the following error:
但是,当我尝试这样做时,出现以下错误:
A value is trying to be set on a copy of a slice from a DataFrame
试图在来自 DataFrame 的切片副本上设置值
采纳答案by jpp
The immediate error is a symptom of using an inadvisable approach when working with dataframes.
即时错误是在处理数据帧时使用不明智方法的症状。
np.random.shuffle
works in-place and returns None
, so assigning to the output of np.random.shuffle
will not work. In fact, in-place operations are rarely required, and often yield no material benefits.
np.random.shuffle
就地工作并返回None
,因此分配给 的输出np.random.shuffle
将不起作用。事实上,就地操作很少需要,而且通常不会产生任何实质性的好处。
Here, for example, you can use np.random.permutation
and use NumPy arrays via pd.Series.values
rather than series:
例如,在这里,您可以np.random.permutation
通过pd.Series.values
而不是系列来使用和使用 NumPy 数组:
if devprod == 'prod':
#do not shuffle data
df1['HS_FIRST_NAME'] = df[4]
df1['HS_LAST_NAME'] = df[6]
df1['HS_SSN'] = df[8]
else:
df1['HS_FIRST_NAME'] = np.random.permutation(df[4].values)
df1['HS_LAST_NAME'] = np.random.permutation(df[6].values)
df1['HS_SSN'] = np.random.permutation(df[8].values)