RDD转DataFrame的两种方法

使用反射来推断包含特定类型对象的 RDD 的模式(Inferring the Schema Using Reflection)

The first method uses reflection to infer the schema of an RDD that contains specific types of objects. This reflection based approach leads to more concise code and works well when you already know the schema while writing your Spark application.

这种基于反射的方法会导致更简洁的代码, 当你在编写 Spark 应用程序的时候已经知道了这个模式。也就是说在运行前你就知道了这个RDD各各字段的类型

这是官网的例子

1
2
3
4
5
6
7
8
9
// For implicit conversions from RDDs to DataFrames
import spark.implicits._

// Create an RDD of Person objects from a text file, convert it to a Dataframe
val peopleDF = spark.sparkContext
.textFile("examples/src/main/resources/people.txt")
.map(_.split(","))
.map(attributes => Person(attributes(0), attributes(1).trim.toInt))
.toDF()

这是自己的实现的例子

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
package com.anthony.spark

import org.apache.spark.sql.SparkSession

/**
* @ Description:
* @ Date: Created in 20:44 2018/3/29
* @ Author: Anthony_Duan
*/
object DataFrameCase {

def main(args: Array[String]): Unit = {

val spark = SparkSession.builder().appName("DataFrameCase").master("local[2]").getOrCreate()

val rdd = spark.sparkContext.textFile("file:///Users/duanjiaxing/data/student.data")

//注意:需要导入隐式转换
import spark.implicits._
val studentDF = rdd.map(_.split("\\|")).map(line => Student(line(0).toInt, line(1), line(2), line(3))).toDF()

studentDF.show()

spark.stop()

}
case class Student(id: Int, name: String, phone: String, email: String)

}

使用编程接口构建一个模式然后应用到RDD的模式(Programmatically Specifying the Schema)

When case classes cannot be defined ahead of time (for example, the structure of records is encoded in a string, or a text dataset will be parsed and fields will be projected differently for different users), a DataFrame can be created programmatically with three steps.
it allows you to construct Datasets when the columns and their types are not known until runtime.

也就是说很多字段的类型只有你运行的时候才能知道

这个方法分为3步

  1. Create an RDD of Rows from the original RDD;
    创建RDD
  2. Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.
    创建与由步骤1中创建的RDD中的行结构匹配的StructType所表示的模式。
  3. Apply the schema to the RDD of Rows via createDataFrame method provided by SparkSession.
    通过SparkSession提供的createDataFrame方法将StructType应用于行的RDD。

官网的例子

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import org.apache.spark.sql.types._

// Create an RDD
val peopleRDD = spark.sparkContext.textFile("examples/src/main/resources/people.txt")

// The schema is encoded in a string
val schemaString = "name age"

// Generate the schema based on the string of schema
val fields = schemaString.split(" ")
.map(fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)

// Convert records of the RDD (people) to Rows
val rowRDD = peopleRDD
.map(_.split(","))
.map(attributes => Row(attributes(0), attributes(1).trim))

// Apply the schema to the RDD
val peopleDF = spark.createDataFrame(rowRDD, schema)
peopleDF.show()

自己的例子,这个例子包含了两种转换方式

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
package com.anthony.spark

import org.apache.spark.sql.types.{StringType, IntegerType, StructField, StructType}
import org.apache.spark.sql.{Row, SparkSession}

/**
* DataFrame和RDD的互操作
*/
object DataFrameRDDApp {

def main(args: Array[String]) {

val spark = SparkSession.builder().appName("DataFrameRDDApp").master("local[2]").getOrCreate()

inferReflection(spark)

program(spark)

spark.stop()
}
/**
* 使用编程的方式RDD转DF
* @param spark
*/
def program(spark: SparkSession): Unit = {
// RDD ==> DataFrame
val rdd = spark.sparkContext.textFile("file:///Users/duanjiaxing/data/infos.txt")

val infoRDD = rdd.map(_.split(",")).map(line => Row(line(0).toInt, line(1), line(2).toInt))

val structType = StructType(Array(StructField("id", IntegerType, nullable = true),
StructField("name", StringType, nullable = true),
StructField("age", IntegerType, nullable = true)))

val infoDF = spark.createDataFrame(infoRDD,structType)
infoDF.printSchema()
infoDF.show()


//通过df的api进行操作
infoDF.filter(infoDF.col("age") > 30).show

//通过sql的方式进行操作
infoDF.createOrReplaceTempView("infos")
spark.sql("select * from infos where age > 30").show()
}


/**
* 使用反射的方式RDD转DF
* @param spark
*/
def inferReflection(spark: SparkSession) {
// RDD ==> DataFrame
val rdd = spark.sparkContext.textFile("file:///Users/duanjiaixng/data/infos.txt")

//注意:需要导入隐式转换
import spark.implicits._
val infoDF = rdd.map(_.split(",")).map(line => Info(line(0).toInt, line(1), line(2).toInt)).toDF()

infoDF.show()

infoDF.filter(infoDF.col("age") > 30).show

infoDF.createOrReplaceTempView("infos")
spark.sql("select * from infos where age > 30").show()
}

case class Info(id: Int, name: String, age: Int)

}

-------------End Of This ArticleThank You For Reading-------------