Python 以编程方式将熊猫数据框转换为降价表

Question

提问by OleVik

I have a Pandas Dataframe generated from a database, which has data with mixed encodings. For example:

我有一个从数据库生成的 Pandas Dataframe，其中包含混合编码的数据。例如：

+----+-------------------------+----------+------------+------------------------------------------------+--------------------------------------------------------+--------------+-----------------------+
| ID | path                    | language | date       | longest_sentence                               | shortest_sentence                                      | number_words | readability_consensus |
+----+-------------------------+----------+------------+------------------------------------------------+--------------------------------------------------------+--------------+-----------------------+
| 0  | data/Eng/Sagitarius.txt | Eng      | 2015-09-17 | With administrative experience in the prepa... | I am able to relocate internationally on short not...  | 306          | 11th and 12th grade   |
+----+-------------------------+----------+------------+------------------------------------------------+--------------------------------------------------------+--------------+-----------------------+
| 31 | data/Nor/H?ylandet.txt  | Nor      | 2015-07-22 | H?gskolen i ?stfold er et eksempel...          | Som skuespiller har jeg b?de...                        | 253          | 15th and 16th grade   |
+----+-------------------------+----------+------------+------------------------------------------------+--------------------------------------------------------+--------------+-----------------------+

As seen there is a mix of English and Norwegian (encoded as ISO-8859-1 in the database I think). I need to get the contents of this Dataframe output as a Markdown table, but without getting problems with encoding. I followed this answer(from the question Generate Markdown tables?) and got the following:

正如所见，有英语和挪威语的混合（我认为在数据库中编码为 ISO-8859-1）。我需要将此 Dataframe 输出的内容作为 Markdown 表获取，但不会遇到编码问题。我遵循了这个答案（来自问题Generate Markdown tables?）并得到以下信息：

import sys, sqlite3

db = sqlite3.connect("Applications.db")
df = pd.read_sql_query("SELECT path, language, date, longest_sentence, shortest_sentence, number_words, readability_consensus FROM applications ORDER BY date(date) DESC", db)
db.close()

rows = []
for index, row in df.iterrows():
    items = (row['date'], 
             row['path'], 
             row['language'], 
             row['shortest_sentence'],
             row['longest_sentence'], 
             row['number_words'], 
             row['readability_consensus'])
    rows.append(items)

headings = ['Date', 
            'Path', 
            'Language',
            'Shortest Sentence', 
            'Longest Sentence since', 
            'Words',
            'Grade level']

fields = [0, 1, 2, 3, 4, 5, 6]
align = [('^', '<'), ('^', '^'), ('^', '<'), ('^', '^'), ('^', '>'),
         ('^','^'), ('^','^')]

table(sys.stdout, rows, fields, headings, align)

However, this yields an UnicodeEncodeError: 'ascii' codec can't encode character u'\xe5' in position 72: ordinal not in range(128)error. How can I output the Dataframe as a Markdown table? That is, for the purpose of storing this code in a file for use in writing a Markdown document. I need the output to look like this:

但是，这会产生UnicodeEncodeError: 'ascii' codec can't encode character u'\xe5' in position 72: ordinal not in range(128)错误。如何将 Dataframe 输出为 Markdown 表？也就是说，为了将此代码存储在文件中以用于编写 Markdown 文档。我需要输出看起来像这样：

| ID | path                    | language | date       | longest_sentence                               | shortest_sentence                                      | number_words | readability_consensus |
|----|-------------------------|----------|------------|------------------------------------------------|--------------------------------------------------------|--------------|-----------------------|
| 0  | data/Eng/Sagitarius.txt | Eng      | 2015-09-17 | With administrative experience in the prepa... | I am able to relocate internationally on short not...  | 306          | 11th and 12th grade   |
| 31 | data/Nor/H?ylandet.txt  | Nor      | 2015-07-22 | H?gskolen i ?stfold er et eksempel...          | Som skuespiller har jeg b?de...                        | 253          | 15th and 16th grade   |

Answer 1

采纳答案by OleVik

Right, so I've taken a leaf from a question suggested by Rohit(Python - Encoding string - Swedish Letters), extended his answer, and came up with the following:

是的，所以我从Rohit提出的一个问题（Python - 编码字符串 - 瑞典字母）中吸取了教训，扩展了他的答案，并提出了以下内容：

# Enforce UTF-8 encoding
import sys
stdin, stdout = sys.stdin, sys.stdout
reload(sys)
sys.stdin, sys.stdout = stdin, stdout
sys.setdefaultencoding('UTF-8')

# SQLite3 database
import sqlite3
# Pandas: Data structures and data analysis tools
import pandas as pd

# Read database, attach as Pandas dataframe
db = sqlite3.connect("Applications.db")
df = pd.read_sql_query("SELECT path, language, date, shortest_sentence, longest_sentence, number_words, readability_consensus FROM applications ORDER BY date(date) DESC", db)
db.close()
df.columns = ['Path', 'Language', 'Date', 'Shortest Sentence', 'Longest Sentence', 'Words', 'Readability Consensus']

# Parse Dataframe and apply Markdown, then save as 'table.md'
cols = df.columns
df2 = pd.DataFrame([['---','---','---','---','---','---','---']], columns=cols)
df3 = pd.concat([df2, df])
df3.to_csv("table.md", sep="|", index=False)

An important precursor to this is that the shortest_sentenceand longest_sentencecolumns do not contain unnecessary line breaks, as removed by applying .replace('\n', ' ').replace('\r', '')to them before submitting into the SQLite database. It appears that the solution is not to enforce the language-specific encoding (ISO-8859-1for Norwegian), but rather that UTF-8is used instead of the default ASCII.

一个重要的前兆是shortest_sentence和longest_sentence列不包含不必要的换行符，因为.replace('\n', ' ').replace('\r', '')在提交到 SQLite 数据库之前通过应用它们来删除。似乎解决方案不是强制执行特定于语言的编码（ISO-8859-1对于挪威语），而是UTF-8使用它代替默认的ASCII.

I ran this through my IPython notebook (Python 2.7.10) and got a table like the following (fixed spacing for appearance here):

我通过我的 IPython 笔记本（Python 2.7.10）运行了这个并得到了一个如下所示的表格（固定间距显示在这里）：

| Path                    | Language | Date       | Shortest Sentence                                                                            | Longest Sentence                                                                                                                                                                                                                                         | Words | Readability Consensus |
|-------------------------|----------|------------|----------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------|-----------------------|
| data/Eng/Something1.txt | Eng      | 2015-09-17 | I am able to relocate to London on short notice.                                             | With my administrative experience in the preparation of the structure and content of seminars in various courses, and critiquing academic papers on various levels, I am confident that I can execute the work required as an editorial assistant.       | 306   | 11th and 12th grade   |
| data/Nor/NoeNorr?nt.txt | Nor      | 2015-09-17 | Jeg har grundig kjennskap til Microsoft Office og Adobe.                                     | I l?pet av studiene har jeg v?rt salgsmedarbeider for et st?rre konsern, hvor jeg solgte forsikring til studentene og de faglige ansatte ved universitetet i Tr?nderlag, samt renholdsarbeider i et annet, hvor jeg i en periode var avdelingsansvarlig. | 205   | 18th and 19th grade   |
| data/Nor/?rret.txt.txt  | Nor      | 2015-09-17 | Jeg h?per p? positiv tilbakemelding, og m?ter naturligvis til intervju hvis det er ?nskelig. | I l?pet av studiene har jeg v?rt salgsmedarbeider for et st?rre konsern, hvor jeg solgte forsikring til studentene og de faglige ansatte ved universitetet i Tr?nderlag, samt renholdsarbeider i et annet, hvor jeg i en periode var avdelingsansvarlig. | 160   | 18th and 19th grade   |

Thus, a Markdown table without problems with encoding.

因此，一个 Markdown 表没有编码问题。

Answer 2

回答by Rohit

Try this out. I got it to work.

试试这个。我让它工作。

See the screenshot of my markdown file converted to HTML at the end of this answer.

在此答案的末尾查看我的 Markdown 文件转换为 HTML 的屏幕截图。

import pandas as pd

# You don't need these two lines
# as you already have your DataFrame in memory
df = pd.read_csv("nor.txt", sep="|")
df.drop(df.columns[-1], axis=1)

# Get column names
cols = df.columns

# Create a new DataFrame with just the markdown
# strings
df2 = pd.DataFrame([['---',]*len(cols)], columns=cols)

#Create a new concatenated DataFrame
df3 = pd.concat([df2, df])

#Save as markdown
df3.to_csv("nor.md", sep="|", index=False)

Answer 3

回答by kpykc

Improving the answer further, for use in IPython Notebook:

进一步改进答案，用于 IPython Notebook：

def pandas_df_to_markdown_table(df):
    from IPython.display import Markdown, display
    fmt = ['---' for i in range(len(df.columns))]
    df_fmt = pd.DataFrame([fmt], columns=df.columns)
    df_formatted = pd.concat([df_fmt, df])
    display(Markdown(df_formatted.to_csv(sep="|", index=False)))

pandas_df_to_markdown_table(infodf)

Or use tabulate:

或使用制表：

pip install tabulate

Examples of use are in the documentation.

使用示例在文档中。

Answer 4

回答by Alastair McCormack

sqlite3 returns Unicodes by default for TEXT fields. Everything was set up to work before you introduced the table()function from an external source (that you did not provide in your question).

sqlite3 默认为 TEXT 字段返回 Unicode。在您table()从外部来源（您没有在问题中提供）引入该功能之前，一切都已设置好。

The table()function has str()calls which do not provide an encoding, so ASCII is used to protect you.

该table()函数具有str()不提供编码的调用，因此使用 ASCII 来保护您。

You need to re-write table()not to do this, especially as you've got Unicode objects. You may have some success by simply replacing str()with unicode()

您需要重新编写table()才能不这样做，尤其是当您有 Unicode 对象时。简单地替换为str()，您可能会取得一些成功unicode()

Answer 5

回答by Daniel Himmelstein

Export a DataFrame to markdown

将 DataFrame 导出到 Markdown

I created the following function for exporting a pandas.DataFrame to markdown in Python:

我创建了以下函数，用于在 Python 中将 pandas.DataFrame 导出到 Markdown：

def df_to_markdown(df, float_format='%.2g'):
    """
    Export a pandas.DataFrame to markdown-formatted text.
    DataFrame should not contain any `|` characters.
    """
    from os import linesep
    return linesep.join([
        '|'.join(df.columns),
        '|'.join(4 * '-' for i in df.columns),
        df.to_csv(sep='|', index=False, header=False, float_format=float_format)
    ]).replace('|', ' | ')

This function may not automatically fix the encoding issues of the OP, but that is a different issue than converting from pandas to markdown.

此功能可能不会自动修复 OP 的编码问题，但这与从 Pandas 转换为 Markdown 是不同的问题。

Answer 6

回答by dubbbdan

I have tried several of the above solutions in this post and found this worked most consistently.

我在这篇文章中尝试了上述几种解决方案，发现这最一致。

To convert a pandas data frame to a markdown table I suggest using pytablewriter. Using the data provided in this post:

要将熊猫数据框转换为降价表，我建议使用pytablewriter。使用本文提供的数据：

import pandas as pd
import pytablewriter
from StringIO import StringIO

c = StringIO("""ID, path,language, date,longest_sentence, shortest_sentence, number_words , readability_consensus 
0, data/Eng/Sagitarius.txt , Eng, 2015-09-17 , With administrative experience in the prepa... , I am able to relocate internationally on short not..., 306, 11th and 12th grade
31 , data/Nor/H?ylandet.txt  , Nor, 2015-07-22 , H?gskolen i ?stfold er et eksempel..., Som skuespiller har jeg b?de..., 253, 15th and 16th grade
""")
df = pd.read_csv(c,sep=',',index_col=['ID'])

writer = pytablewriter.MarkdownTableWriter()
writer.table_name = "example_table"
writer.header_list = list(df.columns.values)
writer.value_matrix = df.values.tolist()
writer.write_table()

This results in:

这导致：

# example_table
ID |           path           |language|    date    |                longest_sentence                |                   shortest_sentence                  | number_words | readability_consensus 
--:|--------------------------|--------|------------|------------------------------------------------|------------------------------------------------------|-------------:|-----------------------
  0| data/Eng/Sagitarius.txt  | Eng    | 2015-09-17 | With administrative experience in the prepa... | I am able to relocate internationally on short not...|           306| 11th and 12th grade   
 31| data/Nor/H?ylandet.txt  | Nor    | 2015-07-22 | H?gskolen i ?stfold er et eksempel...        | Som skuespiller har jeg b?de...                      |           253| 15th and 16th grade

Here is a markdown rendered screenshot.

这是降价渲染的屏幕截图。

Answer 7

回答by Gustavo Bezerra

Here's an example function using pytablewriterand some regular expressions to make the markdown table more similar to how a dataframe looks on Jupyter (with the row headers bold).

这是一个示例函数，使用pytablewriter和一些正则表达式使降价表更类似于数据帧在 Jupyter 上的外观（行标题为粗体）。

import io
import re
import pandas as pd
import pytablewriter

def df_to_markdown(df):
    """
    Converts Pandas DataFrame to markdown table,
    making the index bold (as in Jupyter) unless it's a
    pd.RangeIndex, in which case the index is completely dropped.
    Returns a string containing markdown table.
    """
    isRangeIndex = isinstance(df.index, pd.RangeIndex)
    if not isRangeIndex:
        df = df.reset_index()
    writer = pytablewriter.MarkdownTableWriter()
    writer.stream = io.StringIO()
    writer.header_list = df.columns
    writer.value_matrix = df.values
    writer.write_table()
    writer.stream.seek(0)
    table = writer.stream.readlines()

    if isRangeIndex:
        return ''.join(table)
    else:
        # Make the indexes bold
        new_table = table[:2]
        for line in table[2:]:
            new_table.append(re.sub('^(.*?)\|', r'****|', line))    

        return ''.join(new_table)

Answer 8

回答by Sebastian Jylanki

I recommend python-tabulatelibrary for generating ascii-tables. The library supports pandas.DataFrameas well.

我推荐python-tabulate库来生成 ascii 表。图书馆也支持pandas.DataFrame。

Here is how to use it:

以下是如何使用它：

from pandas import DataFrame
from tabulate import tabulate

df = DataFrame({
    "weekday": ["monday", "thursday", "wednesday"],
    "temperature": [20, 30, 25],
    "precipitation": [100, 200, 150],
}).set_index("weekday")

print(tabulate(df, tablefmt="pipe", headers="keys"))

Output:

输出：

| weekday   |   temperature |   precipitation |
|:----------|--------------:|----------------:|
| monday    |            20 |             100 |
| thursday  |            30 |             200 |
| wednesday |            25 |             150 |

Answer 9

回答by Ilya Prokin

Using external tool pandocand pipe:

使用外部工具pandoc和管道：

def to_markdown(df):
    from subprocess import Popen, PIPE
    s = df.to_latex()
    p = Popen('pandoc -f latex -t markdown',
              stdin=PIPE, stdout=PIPE, shell=True)
    stdoutdata, _ = p.communicate(input=s.encode("utf-8"))
    return stdoutdata.decode("utf-8")

Answer 10

回答by Anake

For those looking for how to do this using tabulate, I thought I'd put this here to save you some time:

对于那些正在寻找如何使用执行此操作的人tabulate，我想我会将它放在这里以节省您一些时间：

print(tabulate(df, tablefmt="pipe", headers="keys", showindex=False))

Python 以编程方式将熊猫数据框转换为降价表

提问by OleVik

采纳答案by OleVik

回答by Rohit

回答by kpykc

回答by Alastair McCormack

回答by Daniel Himmelstein

Export a DataFrame to markdown

将 DataFrame 导出到 Markdown

回答by dubbbdan

回答by Gustavo Bezerra

回答by Sebastian Jylanki

回答by Ilya Prokin

回答by Anake

相关推荐

最近更新

标签

Python 以编程方式将熊猫数据框转换为降价表

提问by OleVik

采纳答案by OleVik

回答by Rohit

回答by kpykc

回答by Alastair McCormack

回答by Daniel Himmelstein

Export a DataFrame to markdown

将 DataFrame 导出到 Markdown

回答by dubbbdan

回答by Gustavo Bezerra

回答by Sebastian Jylanki

回答by Ilya Prokin

回答by Anake

相关推荐

仅从 Python 中的单元素列表中获取元素？

比较两个文件报告python中的差异

如何在python中获取该函数内的当前函数名

在 Python 循环中构建字典 - 列表和字典推导式

相关推荐

最近更新

标签