+---+---------+
| id| types|
+---+---------+
| 1|[a, b, c]|
| 2| [d, d]|
+---+---------+
この様な配列が入った列を別の行に展開してみます。
from pyspark.sql.functions import explode
df = spark.createDataFrame([
(1, ["a","b","c"]),
(2, ["d", "d"])
],
["id", "types"])
df = df.withColumn("type", explode(col("types")))
df.show()
explodeメソッドを使って配列を展開できます。もしtypes列が不要だったら.drop("types")
で消してあげればokです。
+---+---------+----+
| id| types|type|
+---+---------+----+
| 1|[a, b, c]| a|
| 1|[a, b, c]| b|
| 1|[a, b, c]| c|
| 2| [d, d]| d|
| 2| [d, d]| d|
+---+---------+----+
またこの状態から配列に戻すには
df = spark.createDataFrame([
(1, ["a","b","c"]),
(2, ["d", "d"])
],
["id", "types"])
df = df.withColumn("type", explode(col("types")))
df = df.groupby("id").agg(collect_list("type").alias("types"))
df.show()
groupby
とcollect_list
を使ってあげれば元にもどります(順番は異なる可能性あり[未検証])
+---+---------+
| id| types|
+---+---------+
| 1|[a, b, c]|
| 2| [d, d]|
+---+---------+