Pandas - 图像到 DataFrame
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/49649215/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas - image to DataFrame
提问by Terence Eden
I want to convert an RGB image into a DataFrame, so that I have the co-ordinates of each pixel and their RGB value.
我想将 RGB 图像转换为 DataFrame,以便我拥有每个像素的坐标及其 RGB 值。
x y red green blue
0 0 0 154 0 0
1 1 0 149 111 0
2 2 0 153 0 5
3 0 1 154 0 9
4 1 1 154 10 10
5 2 1 154 0 0
I can extract the RGB into a DataFrame quite easily
我可以很容易地将 RGB 提取到 DataFrame 中
colourImg = Image.open("test.png")
colourPixels = colourImg.convert("RGB")
colourArray = np.array(colourPixels.getdata())
df = pd.DataFrame(colourArray, columns=["red","green","blue"])
But I don't know how to get the X & Y coordinates in there. I couldwrite a loop, but on a large image that takes a long time.
但我不知道如何在那里获得 X 和 Y 坐标。我可以写一个循环,但是在需要很长时间的大图像上。
采纳答案by davidsheldon
Try using np.indices
unfortunately it ends up with a array where the coordinate is the first dimension, but you can do a bit of np.moveaxis
to fix that.
尝试使用np.indices
不幸的是它最终得到一个坐标是第一维的数组,但你可以做一些np.moveaxis
来解决这个问题。
colourImg = Image.open("test.png")
colourPixels = colourImg.convert("RGB")
colourArray = np.array(colourPixels.getdata()).reshape(colourImg.size + (3,))
indicesArray = np.moveaxis(np.indices(colourImg.size), 0, 2)
allArray = np.dstack((indicesArray, colourArray)).reshape((-1, 5))
df = pd.DataFrame(allArray, columns=["y", "x", "red","green","blue"])
It's not the pretiest, but it seems to work (edit: fixed x,y being the wrong way around).
它不是最漂亮的,但它似乎有效(编辑:固定 x,y 是错误的方式)。
回答by eugenhu
I've named the coordinates 'col' and 'row' to be explicit and avoid confusion if the x-coordinate is reffering to the column number or row number of your original pixel array:
如果 x 坐标是指原始像素数组的列号或行号,我已将坐标命名为“col”和“row”,以明确表示并避免混淆:
A = colourArray
# Create the multiindex we'll need for the series
index = pd.MultiIndex.from_product(
(*map(range, A.shape[:2]), ('r', 'g', 'b')),
names=('row', 'col', None)
)
# Can be chained but separated for use in explanation
df = pd.Series(A.flatten(), index=index)
df = df.unstack()
df = df.reset_index().reindex(columns=['col', 'row', 'r', 'g', 'b'])
Explanation:
解释:
pd.Series(A.flatten(), index=index)
will create a multiindex series where each channel intensity is accessible via df[row_n, col_n][channel_r_g_or_b]
. The df
variable (currently a series) will now look something like this:
pd.Series(A.flatten(), index=index)
将创建一个多索引系列,其中每个通道强度都可以通过df[row_n, col_n][channel_r_g_or_b]
. 该df
变量(目前A系列)现在是这个样子:
row col
0 0 r 116
g 22
b 220
1 r 75
g 134
b 43
...
255 246 r 79
g 9
b 218
247 r 225
g 172
b 172
unstack()
will pivot the third index (channel index), returning a dataframe with columns b
, g
, r
with each row indexed by a multiindex of (row_n, col_n)
. The df
now looks like this:
unstack()
将旋转第三个索引(通道索引),返回一个包含列b
,的数据帧g
,r
每行由 的多索引索引(row_n, col_n)
。在df
现在看起来是这样的:
b g r
row col
0 0 220 22 116
1 43 134 75
2 187 97 33
... ... ... ... ...
255 226 156 242 128
227 221 63 212
228 75 110 193
We then call reset_index()
to get rid of the (row_n, col_n)
multiindex and just have a flat 0..?(n_pixels-1)
index. The df
is now:
然后我们调用reset_index()
以摆脱多索引(row_n, col_n)
并只有一个平面0..?(n_pixels-1)
索引。现在df
是:
row col b g r
0 0 0 220 22 116
1 0 1 43 134 75
2 0 2 187 97 33
... ... ... ... ... ...
65506 255 226 156 242 128
65507 255 227 221 63 212
65508 255 228 75 110 193
And then a simple reindex()
to rearrange the columns into col
, row
, r
, g
, b
order.
然后简单reindex()
地将列重新排列为col
, row
, r
, g
,b
顺序。
Timings:
时间:
Now as for how fast this runs, well... for a 3-channel image, here are the timings:
现在至于它运行的速度,嗯......对于 3 通道图像,这里是时间:
Size Time
250x250 58.2 ms
500x500 251 ms
1000x1000 1.03 s
2500x2500 8.14 s
Admittedly not great on images > 1 MP. unstack()
can take a while after the df gets very large.
不可否认,图像 > 1 MP 不是很好。unstack()
在 df 变得非常大之后可能需要一段时间。
I've tried @davidsheldon's solutionand it ran a lot quicker, for a 2500x2500 image, it took 244 ms, and a 10000x10000 image took 9.04 s.
我尝试过@davidsheldon 的解决方案,它运行得更快,对于 2500x2500 的图像,它需要 244 毫秒,而 10000x10000 的图像需要 9.04 秒。