pandas read_csv 修复列以读取数据中包含换行符的数据

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/45453093/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:09:31  来源:igfitidea点击:

pandas read_csv fix columns to read data with newline characters in data

pythonregexpandas

提问by Vlad

Using pandas to read in large tab delimited file

使用 Pandas 读取大的制表符分隔文件

df = pd.read_csv(file_path, sep='\t', encoding='latin 1', dtype = str, keep_default_na=False, na_values='')

The problem is that there are 200 columns and the 3rd column is text with occasional newline characters. The text is not delimited with any special characters. These lines get chopped into multiple lines with data going into the wrong columns.

问题是有 200 列,第 3 列是偶尔带有换行符的文本。文本没有用任何特殊字符分隔。这些行被分成多行,数据进入错误的列。

There are a fixed number of tabs in each line - that is all I have to go on.

每行都有固定数量的选项卡 - 这就是我所要做的。

回答by piRSquared

The idea is to use regex to find all instances of stuff separated by a given number of tabs and ending in a newline. Then take all that and create a dataframe.

这个想法是使用正则表达式来查找由给定数量的制表符分隔并以换行符结尾的所有内容实例。然后采取所有这些并创建一个数据框。

import pandas as pd
import re

def wonky_parser(fn):
    txt = open(fn).read()
    #                          This is where I specified 8 tabs
    #                                        V
    preparse = re.findall('(([^\t]*\t[^\t]*){8}(\n|\Z))', txt)
    parsed = [t[0].split('\t') for t in preparse]
    return pd.DataFrame(parsed)

Pass a filename to the function and get your dataframe back.

将文件名传递给函数并取回数据帧。

回答by Prakash Palnati

name your third column

命名你的第三列

df.columns.values[2] = "some_name"

and use converters to pass your function.

并使用转换器来传递您的功能。

pd.read_csv("foo.csv", sep='\t', encoding='latin 1', dtype = str, keep_default_na=False, converters={'some_name':lambda x:x.replace('/n','')})

you could use any manipulating function which works for you under lambda.

您可以在 lambda 下使用任何适合您的操作函数。