pyspark: rdd2dataframe踩过的坑

作者: 张虾米试错 | 来源:发表于2019-12-03 00:02 被阅读0次

pyspark: rdd2dataframe踩过的坑
交互设计师所要避免的几个坑
vue踩过的坑
D1094:踩坑的价值最大化
投资避坑指南
PHP中的数据类型
踩过的坑
踩过的坑
踩过的坑
踩过的坑

大纲

主要记录在rdd2dataframe遇到的问题：

Input row doesn't have expected number of values required by the schema
Some of types cannot be determined by the first 100 rows, please try again with sampling

用toDF()的方式转换

from pyspark.sql import Row
rdd = sc.parallelize([Row(name='Alice', age=5, height=80),Row(name='Bob', age=5, height=80),Row(name='Cycy', age=10, height=80),Row(name='Cycy', age=10, height=80),Row(name='Didi', age=12, height=75)])
df = rdd.toDF()
df.show()
"""
+---+------+-----+                                                              
|age|height| name|
+---+------+-----+
|  5|    80|Alice|
|  5|    80|  Bob|
| 10|    80| Cycy|
| 10|    80| Cycy|
| 12|    75| Didi|
+---+------+-----+
"""

1. Input row doesn't have expected number of values required by the schema

问题：某些行的某些字段缺失

rdd = sc.parallelize([Row(name='Alice', age=5, height=80),Row(name='Bob', age=5, height=80),Row(name='Cycy', age=10, height=80),Row(name='Cycy', height=80),Row(name='Didi', age=12)])
df = rdd.toDF()
df.show()
### error: Input row doesn't have expected number of values required by the schema. 3 fields are required while 2 values are provided.

2. Some of types cannot be determined by the first 100 rows, please try again with sampling

问题：某个字段的类型不一致，导致spark无法识别

rdd = sc.parallelize([Row(name='Alice', age=5, height=80),Row(name='Bob', age=5, height=80),Row(name='Cycy', age=10, height=80),Row(name='Cycy', age=10, height=80),Row(name=-1, age=12.0, height=75.5)])
df = rdd.toDF()
df.show()
"""
+----+------+-----+                                                             
| age|height| name|
+----+------+-----+
|   5|    80|Alice|
|   5|    80|  Bob|
|  10|    80| Cycy|
|  10|    80| Cycy|
|null|  null|   -1|
+----+------+-----+
"""
rdd = sc.parallelize([Row(name=-1, age=10, height=80),Row(name='Alice', age=5, height=80),Row(name='Bob', age=5, height=80),Row(name=-1, age=10, height=80),Row(name=-1, age=10, height=80),Row(name='EaEa', age=5, height=80)])
df = rdd.toDF()
df.show()
"""
+---+------+----+
|age|height|name|
+---+------+----+
| 10|    80|  -1|
|  5|    80|null|
|  5|    80|null|
| 10|    80|  -1|
| 10|    80|  -1|
|  5|    80|null|
+---+------+----+
"""

当某个字段同时存在字符串型和数值型时，以第一个为准，若第一个字段为字符串型那么后面数值型可以正常显示；

扩展问题

同一个字段类型一定要一致，否则toDF()之后会被当作null处理！！！当不指定字段类型时，字段类型会遵照最开始的值。

rdd = sc.parallelize([Row(name='Alice', age=5, height=80),Row(name='Bob', age=5, height=80),Row(name='Cycy', age=10, height=80),Row(name='Cycy', age=10, height=80),Row(name='Didi', age=12.0, height=75.5)])
df = rdd.toDF()
df.show()
"""
+----+------+-----+                                                             
| age|height| name|
+----+------+-----+
|   5|    80|Alice|
|   5|    80|  Bob|
|  10|    80| Cycy|
|  10|    80| Cycy|
|null|  null| Didi|
+----+------+-----+
"""
rdd = sc.parallelize([Row(name='Alice', age=5, height=80.0),Row(name='Bob', age=5.0, height=80),Row(name='Cycy', age=10.0, height=80),Row(name='Cycy', age=10.0, height=80),Row(name='Didi', age=12.0, height=75)])
_df = rdd.toDF()
_df.show()
"""
+----+------+-----+
| age|height| name|
+----+------+-----+
|   5|  80.0|Alice|
|null|  null|  Bob|
|null|  null| Cycy|
|null|  null| Cycy|
|null|  null| Didi|
+----+------+-----+
"""