使用 pandas 或其他 python 模块读取特定列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/26063231/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Read specific columns with pandas or other python module
提问by Daniel Thaagaard Andreasen
I have a csv file from this webpage. I want to read some of the columns in the downloaded file (the csv version can be downloaded in the upper right corner).
我有一个来自这个网页的 csv 文件。我想读取下载文件中的一些列(csv版本可以在右上角下载)。
Let's say I want 2 columns:
假设我想要 2 列:
- 59 which in the header is
star_name - 60 which in the header is
ra.
- 59 在标题中是
star_name - 60 在标题中是
ra.
However, for some reason the authors of the webpage sometimes decide to move the columns around.
但是,出于某种原因,网页的作者有时会决定移动列。
In the end I want something like this, keeping in mind that values can be missing.
最后我想要这样的东西,记住值可能会丢失。
data = #read data in a clever way
names = data['star_name']
ras = data['ra']
This will prevent my program to malfunction when the columns are changed again in the future, if they keep the name correct.
如果它们保持名称正确,这将防止我的程序在将来再次更改列时出现故障。
Until now I have tried various ways using the csvmodule and resently the pandasmodule. Both without any luck.
到目前为止,我已经尝试了各种使用csv模块的方法,并且对pandas模块感到不满。两者都没有运气。
EDIT (added two lines + the header of my datafile. Sorry, but it's extremely long.)
编辑(添加了两行 + 我的数据文件的标题。抱歉,它太长了。)
# name, mass, mass_error_min, mass_error_max, radius, radius_error_min, radius_error_max, orbital_period, orbital_period_err_min, orbital_period_err_max, semi_major_axis, semi_major_axis_error_min, semi_major_axis_error_max, eccentricity, eccentricity_error_min, eccentricity_error_max, angular_distance, inclination, inclination_error_min, inclination_error_max, tzero_tr, tzero_tr_error_min, tzero_tr_error_max, tzero_tr_sec, tzero_tr_sec_error_min, tzero_tr_sec_error_max, lambda_angle, lambda_angle_error_min, lambda_angle_error_max, impact_parameter, impact_parameter_error_min, impact_parameter_error_max, tzero_vr, tzero_vr_error_min, tzero_vr_error_max, K, K_error_min, K_error_max, temp_calculated, temp_measured, hot_point_lon, albedo, albedo_error_min, albedo_error_max, log_g, publication_status, discovered, updated, omega, omega_error_min, omega_error_max, tperi, tperi_error_min, tperi_error_max, detection_type, mass_detection_type, radius_detection_type, alternate_names, molecules, star_name, ra, dec, mag_v, mag_i, mag_j, mag_h, mag_k, star_distance, star_metallicity, star_mass, star_radius, star_sp_type, star_age, star_teff, star_detected_disc, star_magnetic_field
11 Com b,19.4,1.5,1.5,,,,326.03,0.32,0.32,1.29,0.05,0.05,0.231,0.005,0.005,0.011664,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1,2008,2011-12-23,94.8,1.5,1.5,2452899.6,1.6,1.6,Radial Velocity,,,,,11 Com,185.1791667,17.7927778,4.74,,,,,110.6,-0.35,2.7,19.0,G8 III,,4742.0,,
11 UMi b,10.5,2.47,2.47,,,,516.22,3.25,3.25,1.54,0.07,0.07,0.08,0.03,0.03,0.012887,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1,2009,2009-08-13,117.63,21.06,21.06,2452861.05,2.06,2.06,Radial Velocity,,,,,11 UMi,229.275,71.8238889,5.02,,,,,119.5,0.04,1.8,24.08,K4III,1.56,4340.0,,
采纳答案by Daniel Thaagaard Andreasen
An easy way to do this is using the pandaslibrary like this.
一个简单的方法是使用这样的pandas库。
import pandas as pd
fields = ['star_name', 'ra']
df = pd.read_csv('data.csv', skipinitialspace=True, usecols=fields)
# See the keys
print df.keys()
# See content in 'star_name'
print df.star_name
The problem here was the skipinitialspacewhich remove the spaces in the header. So ' star_name' becomes 'star_name'
这里的问题是skipinitialspace删除标题中的空格。所以'star_name'变成'star_name'
回答by frp farhan
Got a solution to above problem in a different way where in although i would read entire csv file, but would tweek the display part to show only the content which is desired.
以不同的方式解决了上述问题,尽管我会读取整个 csv 文件,但会调整显示部分以仅显示所需的内容。
import pandas as pd
df = pd.read_csv('data.csv', skipinitialspace=True)
print df[['star_name', 'ra']]
This one could help in some of the scenario's in learning basics and filtering data on the basis of columns in dataframe.
这个可以帮助一些场景的学习基础知识和基于数据框中的列过滤数据。
回答by decision_scientist_noah
According to the latest pandas documentation you can read a csv file selecting only the columns which you want to read.
根据最新的 Pandas 文档,您可以读取一个 csv 文件,只选择您想要读取的列。
import pandas as pd
df = pd.read_csv('some_data.csv', usecols = ['col1','col2'], low_memory = False)
Here we use usecolswhich reads only selected columns in a dataframe.
这里我们使用usecolswhich 只读取数据框中选定的列。
We are using low_memoryso that we Internally process the file in chunks.
我们正在使用,low_memory以便我们在内部以块的形式处理文件。

