pandas read_csv 修复列以读取数据中包含换行符的数据
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/45453093/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pandas read_csv fix columns to read data with newline characters in data
提问by Vlad
Using pandas to read in large tab delimited file
使用 Pandas 读取大的制表符分隔文件
df = pd.read_csv(file_path, sep='\t', encoding='latin 1', dtype = str, keep_default_na=False, na_values='')
The problem is that there are 200 columns and the 3rd column is text with occasional newline characters. The text is not delimited with any special characters. These lines get chopped into multiple lines with data going into the wrong columns.
问题是有 200 列,第 3 列是偶尔带有换行符的文本。文本没有用任何特殊字符分隔。这些行被分成多行,数据进入错误的列。
There are a fixed number of tabs in each line - that is all I have to go on.
每行都有固定数量的选项卡 - 这就是我所要做的。
回答by piRSquared
The idea is to use regex to find all instances of stuff separated by a given number of tabs and ending in a newline. Then take all that and create a dataframe.
这个想法是使用正则表达式来查找由给定数量的制表符分隔并以换行符结尾的所有内容实例。然后采取所有这些并创建一个数据框。
import pandas as pd
import re
def wonky_parser(fn):
txt = open(fn).read()
# This is where I specified 8 tabs
# V
preparse = re.findall('(([^\t]*\t[^\t]*){8}(\n|\Z))', txt)
parsed = [t[0].split('\t') for t in preparse]
return pd.DataFrame(parsed)
Pass a filename to the function and get your dataframe back.
将文件名传递给函数并取回数据帧。
回答by Prakash Palnati
name your third column
命名你的第三列
df.columns.values[2] = "some_name"
and use converters to pass your function.
并使用转换器来传递您的功能。
pd.read_csv("foo.csv", sep='\t', encoding='latin 1', dtype = str, keep_default_na=False, converters={'some_name':lambda x:x.replace('/n','')})
you could use any manipulating function which works for you under lambda.
您可以在 lambda 下使用任何适合您的操作函数。