pandas read_csv 修复列以读取数据中包含换行符的数据

Question

提问by Vlad

Using pandas to read in large tab delimited file

使用 Pandas 读取大的制表符分隔文件

df = pd.read_csv(file_path, sep='\t', encoding='latin 1', dtype = str, keep_default_na=False, na_values='')

The problem is that there are 200 columns and the 3rd column is text with occasional newline characters. The text is not delimited with any special characters. These lines get chopped into multiple lines with data going into the wrong columns.

问题是有 200 列，第 3 列是偶尔带有换行符的文本。文本没有用任何特殊字符分隔。这些行被分成多行，数据进入错误的列。

There are a fixed number of tabs in each line - that is all I have to go on.

每行都有固定数量的选项卡 - 这就是我所要做的。

Answer 1

回答by piRSquared

The idea is to use regex to find all instances of stuff separated by a given number of tabs and ending in a newline. Then take all that and create a dataframe.

这个想法是使用正则表达式来查找由给定数量的制表符分隔并以换行符结尾的所有内容实例。然后采取所有这些并创建一个数据框。

import pandas as pd
import re

def wonky_parser(fn):
    txt = open(fn).read()
    #                          This is where I specified 8 tabs
    #                                        V
    preparse = re.findall('(([^\t]*\t[^\t]*){8}(\n|\Z))', txt)
    parsed = [t[0].split('\t') for t in preparse]
    return pd.DataFrame(parsed)

Pass a filename to the function and get your dataframe back.

将文件名传递给函数并取回数据帧。

Answer 2

回答by Prakash Palnati

name your third column

命名你的第三列

df.columns.values[2] = "some_name"

and use converters to pass your function.

并使用转换器来传递您的功能。

pd.read_csv("foo.csv", sep='\t', encoding='latin 1', dtype = str, keep_default_na=False, converters={'some_name':lambda x:x.replace('/n','')})

you could use any manipulating function which works for you under lambda.

您可以在 lambda 下使用任何适合您的操作函数。

pandas read_csv 修复列以读取数据中包含换行符的数据

提问by Vlad

回答by piRSquared

回答by Prakash Palnati

相关推荐

最近更新

标签

pandas read_csv 修复列以读取数据中包含换行符的数据

提问by Vlad

回答by piRSquared

回答by Prakash Palnati

相关推荐

pandas 熊猫 - 删除列

pandas 在 Python 地理编码器中使用我的 Google Geocoding API 密钥

过滤列中的字符串/浮点数/整数值（Pandas）

pandas Python tabula 模块中的这个错误是什么？

相关推荐

最近更新

标签