您好,登錄后才能下訂單哦!
這篇文章將為大家詳細講解有關Spark2.2.0實戰中如何自動獲取Json文件元數據信息注冊兩個臨時表以及條件查詢后合并相同記錄數據,文章內容質量較高,因此小編分享給大家做個參考,希望大家閱讀完這篇文章后對相關知識有一定的了解。
Spark支持兩個方式將RDD轉換成DataFrame
1.反射;將schema信息定義在一個單獨的class中,通過這個scheme轉換成對應的DataFrame,這種方式簡單,但不建議用,因為scala的case class最多只支持22個字段,所以必須要自己開發一個類,實現product接口。
2.通過編程接口,自己構建StruntType,將RDD轉換成對應的DataFrame,這種方式稍微麻煩,官網手冊列出大體三個步驟:
翻譯一下大體意思:
1.創建RDD轉換成JavaRDD<Row>
2.按照Row的數據結構定義StructType
3.基于StructType使用createDataFrame創建DataFrame
數據準備:
第一個json文件student.json
{"name":"ljs1","score":85}{"name":"ljs2","score":99}{"name":"ljs3","score":74}
第二個json數據,直接寫在了代碼的低46-49行中,可直接查看代碼獲取
代碼實例:
package com.unicom.ljs.spark220.study;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.sql.*;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import scala.Tuple2;
import java.util.ArrayList;
import java.util.List;
/**
* @author: Created By lujisen
* @company ChinaUnicom Software JiNan
* @date: 2020-01-28 21:08
* @version: v1.0
* @description: com.unicom.ljs.spark220.study
*/
public class JoinJsonData {
public static void main(String[] args) {
SparkConf sparkConf = new SparkConf().setMaster("local[*]").setAppName("JoinJsonData");
JavaSparkContext sc=new JavaSparkContext(sparkConf);
SQLContext sqlContext=new SQLContext(sc);
Dataset<Row> studentDS = sqlContext.read().json("D:\\dataML\\spark1\\student.json");
studentDS.registerTempTable("student_score");
Dataset<Row> studentNameScoreDS = sqlContext.sql("select name,score from student_score where score > 82");
List<String> studentNameList= studentNameScoreDS.javaRDD().map(new Function<Row, String>() {
@Override
public String call(Row row){
return row.getString(0);
}
}).collect();
System.out.println(studentNameList.toString());
List<String> studentJsons=new ArrayList<>();
studentJsons.add("{\"name\":\"ljs1\",\"age\":18}");
studentJsons.add("{\"name\":\"ljs2\",\"age\":17}");
studentJsons.add("{\"name\":\"ljs3\",\"age\":19}");
JavaRDD<String> studentInfos = sc.parallelize(studentJsons);
Dataset<Row> studentNameScoreRDD = sqlContext.read().json(studentInfos);
studentNameScoreRDD.schema();
studentNameScoreRDD.show();
studentNameScoreRDD.registerTempTable("student_age");
String sql2="select name,age from student_age where name in (";
for(int i=0;i<studentNameList.size();i++){
sql2+="'"+studentNameList.get(i)+"'";
if(i<studentNameList.size()-1){
sql2+=",";
}
}
sql2+=")";
Dataset<Row> studentNameAgeDS = sqlContext.sql(sql2);
JavaPairRDD<String, Tuple2<Integer, Integer>> studentNameScoreAge = studentNameScoreDS.toJavaRDD().mapToPair(new PairFunction<Row, String, Integer>() {
@Override
public Tuple2<String, Integer> call(Row row) throws Exception {
return new Tuple2<String, Integer>(row.getString(0),
Integer.valueOf(String.valueOf(row.getLong(1))));
}
}).join(studentNameAgeDS.toJavaRDD().mapToPair(new PairFunction<Row, String, Integer>() {
@Override
public Tuple2<String, Integer> call(Row row) throws Exception {
return new Tuple2<String, Integer>(row.getString(0),
Integer.valueOf(String.valueOf(row.getLong(1))));
}
}));
JavaRDD<Row> studentNameScoreAgeRow = studentNameScoreAge.map(new Function<Tuple2<String, Tuple2<Integer, Integer>>, Row>() {
@Override
public Row call(Tuple2<String, Tuple2<Integer, Integer>> v1) throws Exception {
return RowFactory.create(v1._1, v1._2._1, v1._2._2);
}
});
List<StructField> structFields=new ArrayList<>();
structFields.add(DataTypes.createStructField("name",DataTypes.StringType,true));
structFields.add(DataTypes.createStructField("score",DataTypes.IntegerType,true));
structFields.add(DataTypes.createStructField("age",DataTypes.IntegerType,true));
StructType structType= DataTypes.createStructType(structFields);
Dataset<Row> dataFrame = sqlContext.createDataFrame(studentNameScoreAgeRow, structType);
dataFrame.schema();
dataFrame.show();
dataFrame.write().format("json").mode(SaveMode.Append).save("D:\\dataML\\spark1\\studentNameScoreAge");
}
}
關于Spark2.2.0實戰中如何自動獲取Json文件元數據信息注冊兩個臨時表以及條件查詢后合并相同記錄數據就分享到這里了,希望以上內容可以對大家有一定的幫助,可以學到更多知識。如果覺得文章不錯,可以把它分享出去讓更多的人看到。
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。