date_format
を使うことで簡単に日付データの形式を変換して取得することができたのでメモ。
from pyspark.sql import functions as F
df = spark.createDataFrame([
(1, "2020-01-09 00:00:00"),
(2, "2020-01-09 20:00:00")
],
["id", "date"])
df = df.withColumn("year", F.date_format("date", "yyyy"))
df = df.withColumn("month", F.date_format("date", "MM"))
df = df.withColumn("day", F.date_format("date", "dd"))
df.show()
出力
+---+-------------------+----+-----+---+
| id| date|year|month|day|
+---+-------------------+----+-----+---+
| 1|2020-01-09 00:00:00|2020| 01| 09|
| 2|2020-01-09 20:00:00|2020| 01| 09|
+---+-------------------+----+-----+---+