pandas 熊猫的问题

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/29305131/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:07:13  来源:igfitidea点击:

Problems with Pandas

pythoncsvpandas

提问by kidman01

sorry for the vague title, but since I don't really know what the problem is... the thing is that I want to load a CSV file, then split it up into two arrays and perform a function on each of those arrays. It works for the first array but the second one is making problems even though every thing is the same. I'm really stuck. The Code is as follows:

抱歉,标题含糊不清,但由于我真的不知道问题出在哪里……问题是我想加载一个 CSV 文件,然后将其拆分为两个数组并在每个数组上执行一个函数。它适用于第一个数组,但第二个数组会产生问题,即使每件事都相同。我真的被困住了。代码如下:

from wordutility import wordutility
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import cross_validation
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
import pandas as pd
import numpy as np

data = pd.read_csv('sts_gold_tweet.csv', header=None, delimiter=';',
               quotechar='"')

# test = pd.read_csv('output.csv', header=None,
#                   delimiter=';', quotechar='"')

split_ratio = 0.9
train = data[:round(len(data)*split_ratio)]
test = data[round(len(data)*split_ratio):]

y = data[1]

print("Cleaning and parsing tweets data...\n")

traindata = []

for i in range(0, len(train[0])):
     traindata.append(" ".join(wordutility.tweet_to_wordlist
                          (train[0][i], False)))

testdata = []

for i in range(0, len(test[0])):
    testdata.append(" ".join(wordutility.tweet_to_wordlist(test[0][i], False)))

The program works up until the very last line. The error is:

该程序一直运行到最后一行。错误是:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/lib/python3.4/site-packages/pandas/core/series.py", line 509, in __getitem__
    result = self.index.get_value(self, key)
  File "/usr/lib/python3.4/site-packages/pandas/core/index.py", line   1417, in get_value
    return self._engine.get_value(s, k)
  File "pandas/index.pyx", line 100, in pandas.index.IndexEngine.get_value (pandas/index.c:3097)
  File "pandas/index.pyx", line 108, in pandas.index.IndexEngine.get_value (pandas/index.c:2826)
  File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:3692)
  File "pandas/hashtable.pyx", line 381, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:7201)
  File "pandas/hashtable.pyx", line 387, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:7139)
KeyError: 0

(It says line 2 in the error code because I was trying the code in the python shell. So line 2 refers to the last line of the code above.)

(它说错误代码中的第 2 行,因为我在 python shell 中尝试代码。所以第 2 行指的是上面代码的最后一行。)

Hopefully someone can help me :). Thanks

希望有人可以帮助我:)。谢谢

EDIT

编辑

Ok, it seems like the splitting is not working as I thought it would. I did get two arrays as I wanted but somehow the lines are still as if it was one file. So the array train is from 0 to 1830 and the array test is from 1831 to 2034... so the range was wrong... how would I go about splitting up the csv file "correctly"?

好吧,似乎拆分没有像我想象的那样工作。我确实得到了我想要的两个数组,但不知何故这些行仍然好像它是一个文件。所以数组序列是从 0 到 1830,数组测试是从 1831 到 2034 ......所以范围是错误的......我将如何“正确”拆分 csv 文件?

2 EDIT

2 编辑

>>> print(train[0:5])
                                               0         1
0  the angel is going to miss the athlete this we...  negative 
1  It looks as though Shaq is getting traded to C...  negative
2     @clarianne APRIL 9TH ISN'T COMING SOON ENOUGH   negative
3  drinking a McDonalds coffee and not understand...  negative
4  So dissapointed Taylor Swift doesnt have a Twi...  negativ

>>> print(test[0:5])
                                                  0         1
1831  Why is my PSP always dead when I want to use it?   negative
1832  @hillaryrachel oh i know how you feel. i took ...  negative
1833  @daveknox awesome-  corporate housing took awa...  negative
1834  @lakersnation Is this a joke?  I can't find them   negative
1835                              XBox Live still down   negative

So as you can see the array "test" starts at the line number 1831. I would've thought it would start at 0... I fixed my problem now by editing the range in the for loop

所以你可以看到数组“test”从第 1831 行开始。我原以为它会从 0 开始……我现在通过编辑 for 循环中的范围来解决我的问题

for i in range(len(train[0], len(data)):

So my original problem is fixed, I'm just curious and eager to learn to write better code. Is this an ok thing to do or should I split the csv file in a different way?

所以我原来的问题是固定的,我只是好奇并渴望学习编写更好的代码。这是可以做的事情还是我应该以不同的方式拆分 csv 文件?

采纳答案by TheBlackCat

When you do test[0], you are not getting the first index of test, it is more like you are getting the column of testwith the "name" 0. When you split the pandas DataFrame in two, the original column names were preserved. This means that for the testDataFrame, it has no columns 0, since that column is in the first DataFrame.

当您这样做时test[0],您没有获得 的第一个索引test,更像是您获得了test带有“名称”的列0。当您将 Pandas DataFrame 一分为二时,原始列名被保留。这意味着对于testDataFrame,它没有 columns 0,因为该列在第一个 DataFrame 中。

Let me give you an example. Say you have the following DataFrame:

让我给你举个例子。假设您有以下 DataFrame:

       0   1   2   3   4   5   6   7   8   9
Ind1   0   1   2   3   4   5   6   7   8   9
Ind2  10  11  12  13  14  15  16  17  18  19

When you split it, you end up with these DataFrames:

当你拆分它时,你最终会得到这些数据帧:

       0   1   2   3   4
Ind1   0   1   2   3   4
Ind2  10  11  12  13  14

and:

和:

       5   6   7   8   9
Ind1   5   6   7   8   9
Ind2  15  16  17  18  19

Notice that the columns of the second DataFrame starts with 5, not 0, because those were the column names before the split. So when you try to get column 0, it isn't there. That is the source of your error.

请注意,第二个数据帧的列开头5,没有0,因为那些都是在分立前的列名。因此,当您尝试获取 column 时0,它不存在。这就是你错误的根源。

The simplest solution would just be to use the index, rather than the column name. So instead of something like test[0], use test.iloc[0]. That will give the value based on positional index.

最简单的解决方案就是使用索引,而不是列名。因此,不要test[0]使用类似的东西,而是使用test.iloc[0]. 这将给出基于位置索引的值。