PySparkで日付情報を別カラムに分割する

+---+-------------------+
| id|               date|
+---+-------------------+
|  1|2020-11-28 20:01:02|
|  2|2020-11-29 21:03:04|
|  3|2020-11-30 22:05:06|
+---+-------------------+

こちらのdateカラムのデータをカラムごとに分割してみます。

from pyspark.sql.functions import col, date_format, year, month, dayofmonth, hour, minute, second, dayofweek

df = spark.createDataFrame([
        (1, "2020-11-28 20:01:02"),
        (2, "2020-11-29 21:03:04"),
        (3, "2020-11-30 22:05:06")
    ],
    ["id", "date"])

df = df.withColumn("year", year(col("date")))
df = df.withColumn("month", month(col("date")))
df = df.withColumn("day", dayofmonth(col("date")))
df = df.withColumn("hour", hour(col("date")))
df = df.withColumn("minute", minute(col("date")))
df = df.withColumn("second", second(col("date")))
df = df.withColumn("dayofweek", dayofweek(col("date")))

df.show()

dayofweekは日曜を1とした曜日の数字が入ります。

+---+-------------------+----+-----+---+----+------+------+---------+
| id|               date|year|month|day|hour|minute|second|dayofweek|
+---+-------------------+----+-----+---+----+------+------+---------+
|  1|2020-11-28 20:01:02|2020|   11| 28|  20|     1|     2|        7|
|  2|2020-11-29 21:03:04|2020|   11| 29|  21|     3|     4|        1|
|  3|2020-11-30 22:05:06|2020|   11| 30|  22|     5|     6|        2|
+---+-------------------+----+-----+---+----+------+------+---------+

参照