pandas 将pandas系列输出到txt文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/48831802/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Outputting pandas series to txt file
提问by mrsquid
I have a pandas series object
我有一个Pandas系列对象
<class 'pandas.core.series.Series'>
that look like this:
看起来像这样:
userId
1 3072 1196 838 2278 1259
2 648 475 1 151 1035
3 457 150 300 21 339
4 1035 7153 953 4993 2571
5 260 671 1210 2628 7153
6 4993 1210 2291 589 1196
7 150 457 111 246 25
8 1221 8132 30749 44191 1721
9 296 377 2858 3578 3256
10 2762 377 2858 1617 858
11 527 593 2396 318 1258
12 3578 2683 2762 2571 2580
13 7153 150 5952 35836 2028
14 1197 2580 2712 2762 1968
15 1245 1090 1080 2529 1261
16 296 2324 4993 7153 1203
17 1208 1234 6796 55820 1060
18 1377 1 1073 1356 592
19 778 1173 272 3022 909
20 329 534 377 73 272
21 608 904 903 1204 111
22 1221 1136 1258 4973 48516
23 1214 1200 1148 2761 2791
24 593 318 162 480 733
25 314 969 25 85 766
26 293 253 4878 46578 64614
27 1193 2716 24 2959 2841
28 318 260 58559 8961 4226
29 318 260 1196 2959 50
30 1077 1136 1230 1203 3481
642 123 593 750 1212 50
643 750 671 1663 2427 5618
644 780 3114 1584 11 62
645 912 2858 1617 1035 903
646 608 527 21 2710 1704
647 1196 720 5060 2599 594
648 46578 50 745 1223 5995
649 318 300 110 529 246
650 733 110 151 318 364
651 1240 1210 541 589 1247
652 4993 296 95510 122900 736
653 858 1225 1961 25 36
654 333 1221 3039 1610 4011
655 318 47 6377 527 2028
656 527 1193 1073 1265 73
657 527 349 454 357 97
658 457 590 480 589 329
659 474 508 1 288 477
660 904 1197 1247 858 1221
661 780 1527 3 1376 5481
662 110 590 50 593 733
663 2028 919 527 2791 110
664 1201 64839 1228 122886 1203
665 1197 858 7153 1221 6539
666 318 300 161 500 337
667 527 260 318 593 223
668 161 527 151 110 300
669 50 2858 4993 318 2628
670 296 5952 508 272 1196
671 1210 1200 7153 593 110
What is the best way to go about outputting this to a txt file (e.g. output.txt) such that the format look like this?
将其输出到 txt 文件(例如 output.txt)以使格式看起来像这样的最佳方法是什么?
User-id1 movie-id1 movie-id2 movie-id3 movie-id4 movie-id5
User-id2 movie-id1 movie-id2 movie-id3 movie-id4 movie-id5
The values on the far left are the userId's and the other values are the movieId's.
最左边的值是 userId 的值,其他值是 movieId 的值。
Here is the code that generated the above:
这是生成上述内容的代码:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
def predict(l):
# finds the userIds corresponding to the top 5 similarities
# calculate the prediction according to the formula
return (df[l.index] * l).sum(axis=1) / l.sum()
# use userID as columns for convinience when interpretering the forumla
df = pd.read_csv('ratings.csv').pivot(columns='userId',
index='movieId',
values='rating')
df = df - df.mean()
similarity = pd.DataFrame(cosine_similarity(
df.T.fillna(0)), index=df.columns, columns=df.columns)
res = df.apply(lambda col: ' '.join('{}'.format(mid) for mid in (0 * col).fillna(
predict(similarity[col.name].nlargest(6).iloc[1:])).nlargest(5).index))
#Do not understand why this does not work for me but works below
df = pd.DataFrame.from_items(zip(res.index, res.str.split(' ')))
#print(df)
df.columns = ['movie-id1', 'movie-id2', 'movie-id3', 'movie-id4', 'movie-id5']
df['customer_id'] = df.index
df = df[['customer_id', 'movie-id1', 'movie-id2', 'movie-id3', 'movie-id4', 'movie-id5']]
df.to_csv('filepath.txt', sep=' ', index=False)
I tried implementing @emmet02 solution but got this error, I do not understand why I got it though:
我尝试实施@emmet02 解决方案,但出现此错误,但我不明白为什么会得到它:
ValueError: Length mismatch: Expected axis has 671 elements, new values have 5 elements
Any advice is appreciated, please let me know if you need any more information or clarification.
感谢任何建议,如果您需要更多信息或澄清,请告诉我。
回答by emmet02
I would suggest turning your pd.Series into a pd.DataFrame first.
我建议先把你的 pd.Series 变成 pd.DataFrame 。
df = pd.DataFrame.from_items(zip(series.index, series.str.split(' '))).T
So long as the Series has the same number of values (for every entry!), separated by a space, this will return a dataframe in this format
只要系列具有相同数量的值(对于每个条目!),用空格分隔,这将返回此格式的数据帧
Out[49]:
0 1 2 3 4
0 3072 648 457 1035 260
1 1196 475 150 7153 671
2 838 1 300 953 1210
3 2278 151 21 4993 2628
4 1259 1035 339 2571 7153
Next I would name the columns appropriately
接下来我会适当地命名列
df.columns = ['movie-id1', 'movie-id2', 'movie-id3', 'movie-id4', 'movie-id5']
Finally, the dataframe is indexed by customer id (I am supposing this based upon your series index). We want to move that into the dataframe, and then reorganise the columns.
最后,数据框由客户 ID 索引(我假设这是基于您的系列索引)。我们想把它移到数据框中,然后重新组织列。
df['customer_id'] = df.index
df = df[['customer_id', 'movie-id1', 'movie-id2', 'movie-id3', 'movie-id4', 'movie-id5']]
This now leaves you with a dataframe like this
这现在给你一个像这样的数据框
customer_id movie-id1 movie-id2 movie-id3 movie-id4 movie-id5
0 0 3072 648 457 1035 260
1 1 1196 475 150 7153 671
2 2 838 1 300 953 1210
3 3 2278 151 21 4993 2628
4 4 1259 1035 339 2571 7153
which I would recommend you write to disk as a csv using
我建议您使用
df.to_csv('filepath.csv', index=False)
If however you want to write it as a text file, with only spaces separating, you can use the same function but pass the separator.
然而,如果你想把它写成一个文本文件,只用空格分隔,你可以使用相同的函数,但传递分隔符。
df.to_csv('filepath.txt', sep=' ', index=False)
I don't think that the Series object is the correct choice of data structure for the problem you want to solve. Treating numerical data as numerical data (and in a DataFrame) is far easier than maintaining 'space delimited string' conversions imo.
我不认为 Series 对象是您要解决的问题的正确数据结构选择。将数字数据视为数字数据(并在 DataFrame 中)比维护“空格分隔字符串”转换要容易得多。
回答by holypriest
You can use the following approach, splitting the items of your Series
object (that I called s
) into lists and converting those a list of those lists into a DataFrame
object (that I called df
):
您可以使用以下方法,将Series
对象(我称为s
)的项目拆分为列表,并将这些列表的列表转换为DataFrame
对象(我称为df
):
df = pd.DataFrame([[s.index[i]] + s.str.split(' ')[i] for i in range(0, len(s))])
The [s.index[i]] + s.str.split(' ')[i]
part is responsible for concatenation of the index at the beginning of the movie ids lists, and this is done for all rows available in the series.
该[s.index[i]] + s.str.split(' ')[i]
部分负责连接电影 ID 列表开头的索引,这对系列中的所有可用行都执行。
After that, you could just dump the DataFrame
to a .txt
file using a space as separator:
之后,您可以使用空格作为分隔符DataFrame
将其转储到.txt
文件中:
df.to_csv('output.txt', sep=' ', index=False)
You could also name your columns before dumping it, as suggested earlier.
如前所述,您还可以在转储之前命名您的列。
回答by sgDysregulation
I suggest modifying the code as shown below
我建议修改代码如下所示
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
def predict(l):
# finds the userIds corresponding to the top 5 similarities
# calculate the prediction according to the formula
return (df[l.index] * l).sum(axis=1) / l.sum()
# use userID as columns for convinience when interpretering the forumla
df = pd.read_csv('ratings.csv').pivot(columns='userId',
index='movieId',
values='rating')
df = df - df.mean()
similarity = pd.DataFrame(cosine_similarity(
df.T.fillna(0)), index=df.columns, columns=df.columns)
res = df.apply(lambda col: (0 * col).fillna(
predict(similarity[col.name].nlargest(6).iloc[1:])
).nlargest(5).index.tolist()
).apply(pd.Series).rename(
columns=lambda col_name: 'movie-id{}'.format(col_name + 1)).reset_index(
).rename(columns={'userId': 'customer_id'})
# convert to csv
res.to_csv('filepath.txt', sep = ' ',index = False)
res.head()
res.head()
In [2]: res.head()
Out[2]:
customer_id movie-id1 movie-id2 movie-id3 movie-id4 movie-id5
0 1 3072 1196 838 2278 1259
1 2 648 475 1 151 1035
2 3 457 150 300 21 339
3 4 1035 7153 953 4993 2571
4 5 260 671 1210 2628 7153
show the file
显示文件
In [3]: ! head -5 filepath.txt
customer_id movie-id1 movie-id2 movie-id3 movie-id4 movie-id5
1 3072 1196 838 2278 1259
2 648 475 1 151 1035
3 457 150 300 21 339
4 1035 7153 953 4993 2571
回答by matanster
It's also worth avoiding that csv-writing hackery, kind of required when the series is text to avoid escaping/quoting hell. A la:
避免使用 csv 编写技巧也是值得的,当系列是文本时,这是必需的,以避免转义/引用地狱。啦啦:
with open(filename, 'w') as f:
for entry in df['target_column']:
f.write(entry)
Of course you can add the series index yourself in the loop, if desired.
当然,如果需要,您可以自己在循环中添加系列索引。