Pandas DataFrame 的单列中的多个值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/26792852/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Multiple values in single column of a pandas DataFrame
提问by StFS
I have some data that I'm parsing from XML to a pandas DataFrame. The XML data roughly looks like this:
我有一些数据正在从 XML 解析为 Pandas DataFrame。XML 数据大致如下所示:
<tracks>
  <track name="trackname1" variants="1,2,3,4,5">
    <variant var="1,2,3">
      <leg time="21:23" route_id="5" stop_id="103" serial="1"/>
      <leg time="21:26" route_id="5" stop_id="17" serial="2"/>
      <leg time="21:30" route_id="5" stop_id="38" serial="3"/>
      <leg time="20:57" route_id="8" stop_id="101" serial="1"/>
      <leg time="21:01" route_id="8" stop_id="59" serial="2"/>
      ...
    </variant>
    <variant var="4,5">
      ... more leg elements
    </variant>
  </track>
  <track name="trackname2" variants="1,2,3,4,5,6,7">
    <variant var="1">
      ... more leg elements
    </variant>
    <variant var="2,3,4,5,7">
      ... more leg elements
    </variant>
  </track>
</tracks>
I'm importing this into pandas because I need to be able to join this data with other DataFrames and I need to be able to query for stuff like: "get all legs of variant 1 for route_id 5".
我将它导入到 Pandas 中,因为我需要能够将这些数据与其他 DataFrame 连接起来,并且我需要能够查询诸如“获取 route_id 5 的变体 1 的所有部分”之类的内容。
I'm trying to figure out how I would do this in a pandas DataFrame. Should I make a DataFrame that would look something like this:
我想弄清楚如何在 Pandas DataFrame 中做到这一点。我应该制作一个看起来像这样的 DataFrame 吗:
track_name     variants  time     route_id  stop_id  serial
"trackname1"   "1,2,3"   "21:23"  "5"       "103"    "1"
"trackname1"   "1,2,3"   "21:26"  "5"       "17"     "2"
...
"trackname1"   "4,5"     "21:20"  "5"       "103"    "1"
...
"trackname2"   "1"       "20:59"  "3"       "45"     "1"
... you get the point
If this is the way to go, how would I (efficiently) extract for example "all rows for variant 3 on route_id 5"? Note that this should give me all the rows that have 3 in the variant column list, not just the rows that onlyhave "3" in the variants column.
如果这是要走的路,我将如何(有效地)提取例如“route_id 5 上变体 3 的所有行”?请注意,这应该为我提供变体列列表中具有 3 的所有行,而不仅仅是变体列中只有“3”的行。
Is there a different way of constructing the DataFrame that would make this easier? Should I be using something other than pandas?
有没有一种不同的方式来构建 DataFrame 可以使这更容易?我应该使用Pandas以外的东西吗?
回答by unutbu
Assuming you have enough memory, your task will be more easily accomplished if your DataFrame held one variant per row:
假设您有足够的内存,如果您的 DataFrame 每行保存一个变体,您的任务将更容易完成:
track_name     variants  time     route_id  stop_id  serial
"trackname1"   1         "21:23"         5      103       1
"trackname1"   2         "21:23"         5      103       1
"trackname1"   3         "21:23"         5      103       1
"trackname1"   1         "21:26"         5       17       2
"trackname1"   2         "21:26"         5       17       2
"trackname1"   3         "21:26"         5       17       2
...
"trackname1"   4         "21:20"         5      103       1
"trackname1"   5         "21:20"         5      103       1
...
"trackname2"   1         "20:59"         3       45       1
Then you could find "all rows for variant 3 on route_id 5 with
然后你可以找到“在 route_id 5 上变体 3 的所有行
df.loc[(df['variants']==3) & (df['route_id']==5)]
If you pack many variants into one row, such as
如果您将许多变体打包成一行,例如
"trackname1"   "1,2,3"   "21:23"  "5"       "103"    "1"
then you could find such rows using
那么你可以找到这样的行使用
df.loc[(df['variants'].str.contains("3")) & (df['route_id']=="5")]
assumingthat the variants are always single digits. If there are also 2-digit variants like "13" or "30", then you would need to pass a more complicated regex pattern to str.contains. 
假设变体总是个位数。如果还有像“13”或“30”这样的 2 位变体,那么您需要将更复杂的正则表达式模式传递给str.contains.
Alternatively, you could use applyto split each variant on commas:
或者,您可以使用apply逗号分隔每个变体:
df['variants'].apply(lambda x: "3" in x.split(','))
but this is very inefficent since you would now be calling a Python function once for every row, and doing string splitting and a test for membership in a list compared to a vectorized integer comparision.
但这非常低效,因为您现在将为每一行调用一次 Python 函数,并与矢量化整数比较相比,进行字符串拆分和列表中的成员资格测试。
Thus, to avoid possibly complicated regex or a relatively slow call to apply, I think your best bet is to build the DataFrame with one integer variant per row.
因此,为了避免可能复杂的正则表达式或对 的相对较慢的调用apply,我认为最好的办法是构建每行一个整数变体的 DataFrame。

