大纲
主要记录在rdd2dataframe遇到的问题:
- Input row doesn't have expected number of values required by the schema
- Some of types cannot be determined by the first 100 rows, please try again with sampling
用toDF()的方式转换
from pyspark.sql import Row
rdd = sc.parallelize([Row(name='Alice', age=5, height=80),Row(name='Bob', age=5, height=80),Row(name='Cycy', age=10, height=80),Row(name='Cycy', age=10, height=80),Row(name='Didi', age=12, height=75)])
df = rdd.toDF()
df.show()
"""
+---+------+-----+
|age|height| name|
+---+------+-----+
| 5| 80|Alice|
| 5| 80| Bob|
| 10| 80| Cycy|
| 10| 80| Cycy|
| 12| 75| Didi|
+---+------+-----+
"""
1. Input row doesn't have expected number of values required by the schema
问题:某些行的某些字段缺失
rdd = sc.parallelize([Row(name='Alice', age=5, height=80),Row(name='Bob', age=5, height=80),Row(name='Cycy', age=10, height=80),Row(name='Cycy', height=80),Row(name='Didi', age=12)])
df = rdd.toDF()
df.show()
### error: Input row doesn't have expected number of values required by the schema. 3 fields are required while 2 values are provided.
2. Some of types cannot be determined by the first 100 rows, please try again with sampling
问题:某个字段的类型不一致,导致spark无法识别
rdd = sc.parallelize([Row(name='Alice', age=5, height=80),Row(name='Bob', age=5, height=80),Row(name='Cycy', age=10, height=80),Row(name='Cycy', age=10, height=80),Row(name=-1, age=12.0, height=75.5)])
df = rdd.toDF()
df.show()
"""
+----+------+-----+
| age|height| name|
+----+------+-----+
| 5| 80|Alice|
| 5| 80| Bob|
| 10| 80| Cycy|
| 10| 80| Cycy|
|null| null| -1|
+----+------+-----+
"""
rdd = sc.parallelize([Row(name=-1, age=10, height=80),Row(name='Alice', age=5, height=80),Row(name='Bob', age=5, height=80),Row(name=-1, age=10, height=80),Row(name=-1, age=10, height=80),Row(name='EaEa', age=5, height=80)])
df = rdd.toDF()
df.show()
"""
+---+------+----+
|age|height|name|
+---+------+----+
| 10| 80| -1|
| 5| 80|null|
| 5| 80|null|
| 10| 80| -1|
| 10| 80| -1|
| 5| 80|null|
+---+------+----+
"""
当某个字段同时存在字符串型和数值型时,以第一个为准,若第一个字段为字符串型那么后面数值型可以正常显示;
扩展问题
同一个字段类型一定要一致,否则toDF()之后会被当作null处理!!!当不指定字段类型时,字段类型会遵照最开始的值。
rdd = sc.parallelize([Row(name='Alice', age=5, height=80),Row(name='Bob', age=5, height=80),Row(name='Cycy', age=10, height=80),Row(name='Cycy', age=10, height=80),Row(name='Didi', age=12.0, height=75.5)])
df = rdd.toDF()
df.show()
"""
+----+------+-----+
| age|height| name|
+----+------+-----+
| 5| 80|Alice|
| 5| 80| Bob|
| 10| 80| Cycy|
| 10| 80| Cycy|
|null| null| Didi|
+----+------+-----+
"""
rdd = sc.parallelize([Row(name='Alice', age=5, height=80.0),Row(name='Bob', age=5.0, height=80),Row(name='Cycy', age=10.0, height=80),Row(name='Cycy', age=10.0, height=80),Row(name='Didi', age=12.0, height=75)])
_df = rdd.toDF()
_df.show()
"""
+----+------+-----+
| age|height| name|
+----+------+-----+
| 5| 80.0|Alice|
|null| null| Bob|
|null| null| Cycy|
|null| null| Cycy|
|null| null| Didi|
+----+------+-----+
"""
网友评论