Some useful patterns / tricks I find myself using at work.
Sometimes it is usefull to append DataFrame’s in a list and union them back together afterwards.
from pyspark.sql import DataFrame
from functools import reduce
= []
shell for i in range(M):
= ...
df
shell.append(df)
= reduce(DataFrame.unionAll, shell) res
Each df in the loop has the same schema. Note that
.unionAll
and .union
are equivalent,
but .unionAll
is more explicit.
A broadcast variable is useful too look up values, the values
in a broadcast variable can be accessed using the
.value
attribute in a udf.
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types as T
= SparkSession.builder.getOrCreate()
spark
= {
foo 'a': 'b',
}= spark.sparkContext.broadcast(foo)
bc_foo
def _look_up(x):
= bc_foo.value
foo_ return foo_[x]
= F.udf(_look_up, T.DoubleType()) udf_calculate
A UDF can return multiple values if one return af StructType.
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types as T
def _calculate(x):
... return x, y, z
= T.StructType([
schema "x", T.DoubleType(),
T.StructField("y", T.DoubleType(),
T.StructField("z", T.DoubleType(),
T.StructField(
])= F.udf(_calculate, schema) udf_multiple_results
This UDF will take one input x and return a nested column with three fields called x, y, z. Can be accessed using:
= (
df
df
.withColumn("nested_column",
'x')
udf_multiple_results(
)
.select('nested_column.x').alias('x'),
F.col('nested_column.y').alias('y'),
F.col('nested_column.z').alias('z'),
F.col(
) )
And flattened using
= df.select('nested_column.*') df
Feel free to comment here below. A Github account is required.