Python AnalysisException: u"cannot resolve 'name' given input columns: [list] in sqlContext in spark

Question

提问by Elm662

I tried a simple example like:

我尝试了一个简单的例子，如：

data = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load("/databricks-datasets/samples/population-vs-price/data_geo.csv")

data.cache() # Cache data for faster reuse
data = data.dropna() # drop rows with missing values
data = data.select("2014 Population estimate", "2015 median sales price").map(lambda r: LabeledPoint(r[1], [r[0]])).toDF()

It works well, But when i try something very similar like:

它运作良好，但是当我尝试非常相似的事情时：

data = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load('/mnt/%s/OnlineNewsTrainingAndValidation.csv' % MOUNT_NAME)

data.cache() # Cache data for faster reuse
data = data.dropna() # drop rows with missing values
data = data.select("timedelta", "shares").map(lambda r: LabeledPoint(r[1], [r[0]])).toDF()
display(data)

It raise error: AnalysisException: u"cannot resolve 'timedelta' given input columns: [ data_channel_is_tech,...

它引发错误：AnalysisException: u"cannot resolve 'timedelta' given input columns: [ data_channel_is_tech,...

off-course I imported LabeledPoint and LinearRegression

课外我导入了 LabeledPoint 和 LinearRegression

What could be wrong?

可能有什么问题？

Even the simpler case

即使是更简单的情况

df_cleaned = df_cleaned.select("shares")

raises same AnalysisException (error).

引发相同的 AnalysisException（错误）。

*please note: df_cleaned.printSchema() works well.

*请注意：df_cleaned.printSchema() 运行良好。

Answer 1

回答by Elm662

I found the issue: some of the column names contain white spaces before the name itself. So

我发现了这个问题：一些列名在名称本身之前包含空格。所以

data = data.select(" timedelta", " shares").map(lambda r: LabeledPoint(r[1], [r[0]])).toDF()

worked. I could catch the white spaces using

工作。我可以使用

assert " " not in ''.join(df.columns)

Now I am thinking of a way to remove the white spaces. Any idea is much appreciated!

现在我正在考虑一种删除空格的方法。任何想法都非常感谢！

Answer 2

回答by y durga prasad

Because header contains spaces or tabs,remove spaces or tabs and try

因为标题包含空格或制表符，请删除空格或制表符并尝试

1) My example script

1）我的示例脚本

from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()

df=spark.read.csv(r'test.csv',header=True,sep='^')
print("#################################################################")
print df.printSchema()
df.createOrReplaceTempView("test")
re=spark.sql("select max_seq from test")
print(re.show())
print("################################################################")

2) Input file,here 'max_seq ' contains space so we are getting bellow exception

2）输入文件，这里'max_seq'包含空间，所以我们得到了以下异常

Trx_ID^max_seq ^Trx_Type^Trx_Record_Type^Trx_Date

Traceback (most recent call last):
  File "D:/spark-2.1.0-bin-hadoop2.7/bin/test.py", line 14, in <module>
    re=spark.sql("select max_seq from test")
  File "D:\spark-2.1.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\session.py", line 541, in sql
  File "D:\spark-2.1.0-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\java_gateway.py", line 1133, in __call__
  File "D:\spark-2.1.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\utils.py", line 69, in deco
pyspark.sql.utils.AnalysisException: u"cannot resolve '`max_seq`' given input columns: [Venue_City_Name, Trx_Type, Trx_Booking_Status_Committed, Payment_Reference1, Trx_Date, max_seq , Event_ItemVariable_Name, Amount_CurrentPrice, cinema_screen_count, Payment_IsMyPayment, r

2) Remove space after 'max_seq' column then it will work fine

2) 删除 'max_seq' 列之后的空间，然后它会正常工作

Trx_ID^max_seq^Trx_Type^Trx_Record_Type^Trx_Date


17/03/20 12:16:25 INFO DAGScheduler: Job 3 finished: showString at <unknown>:0, took 0.047602 s
17/03/20 12:16:25 INFO CodeGenerator: Code generated in 8.494073 ms
  max_seq
    10
    23
    22
    22
only showing top 20 rows

None
##############################################################

Answer 3

回答by PPK

As there were tabs in my input file, removing the tabs or spaces in the header helped display the answer.

My example:

saledf = spark.read.csv("SalesLTProduct.txt", header=True, inferSchema= True, sep='\t')


saledf.printSchema()

root
|-- ProductID: string (nullable = true)
|-- Name: string (nullable = true)
|-- ProductNumber: string (nullable = true)

saledf.describe('ProductNumber').show()

 +-------+-------------+
 |summary|ProductNumber|
 +-------+-------------+
 |  count|          295|
 |   mean|         null|
 | stddev|         null|
 |    min|      BB-7421|
 |    max|      WB-H098|
 +-------+-------------+

Python AnalysisException: u"cannot resolve 'name' given input columns: [list] in sqlContext in spark

提问by Elm662

回答by Elm662

回答by y durga prasad

回答by PPK

相关推荐

最近更新

标签

Python AnalysisException: u"cannot resolve 'name' given input columns: [list] in sqlContext in spark

提问by Elm662

回答by Elm662

回答by y durga prasad

回答by PPK

相关推荐

python pandas通过列名列表从数据框中选择列

在 Python 中循环遍历 JSON 数组

Python 从 Keras 功能模型中获取类标签

Python 在 virtualenv 中运行 Jupyter notebook：安装的 sklearn 模块不可用

相关推荐

最近更新

标签