spark-sql的進階案例

發布時間：2020-05-25 06:38:17 來源：網絡閱讀：808 作者：原生zzy 欄目：大數據

（1）骨灰級案例--UDTF求wordcount

數據格式：
spark-sql的進階案例
每一行都是字符串并且以空格分開。
代碼實現：

object SparkSqlTest {
    def main(args: Array[String]): Unit = {
        //屏蔽多余的日志
        Logger.getLogger("org.apache.hadoop").setLevel(Level.WARN)
        Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
        Logger.getLogger("org.project-spark").setLevel(Level.WARN)
        //構建編程入口
        val conf: SparkConf = new SparkConf()
        conf.setAppName("SparkSqlTest")
            .setMaster("local[2]")

        val spark: SparkSession = SparkSession.builder().config(conf)
            .enableHiveSupport()
            .getOrCreate()

        //創建sqlcontext對象
        val sqlContext: SQLContext = spark.sqlContext
        val wordDF: DataFrame = sqlContext.read.text("C:\\z_data\\test_data\\ip.txt").toDF("line")
        wordDF.createTempView("lines")
        val sql=
            """
              |select t1.word,count(1) counts
              |from (
              |select explode(split(line,'\\s+')) word
              |from lines) t1
              |group by t1.word
              |order by counts
            """.stripMargin
        spark.sql(sql).show()
    }
}

結果：
spark-sql的進階案例

（2）窗口函數求topN

數據格式：
spark-sql的進階案例
取每門課程中成績最好的前三
代碼實現：

object SparkSqlTest {
    def main(args: Array[String]): Unit = {
        //屏蔽多余的日志
        Logger.getLogger("org.apache.hadoop").setLevel(Level.WARN)
        Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
        Logger.getLogger("org.project-spark").setLevel(Level.WARN)
        //構建編程入口
        val conf: SparkConf = new SparkConf()
        conf.setAppName("SparkSqlTest")
            .setMaster("local[2]")

        val spark: SparkSession = SparkSession.builder().config(conf)
            .enableHiveSupport()
            .getOrCreate()

        //創建sqlcontext對象
        val sqlContext: SQLContext = spark.sqlContext
        val topnDF: DataFrame = sqlContext.read.json("C:\\z_data\\test_data\\score.json")
        topnDF.createTempView("student")
        val sql=
            """select
              |t1.course course,
              |t1.name name,
              |t1.score score
              |from (
              |select
              |course,
              |name,
              |score,
              |row_number() over(partition by course order by score desc ) top
              |from student) t1 where t1.top<=3
            """.stripMargin
        spark.sql(sql).show()
    }
}

結果：
spark-sql的進階案例

（3）SparkSQL去處理DataSkew數據傾斜的問題

思路： (使用兩階段的聚合)
- 找到發生數據傾斜的key
- 對發生傾斜的數據的key進行拆分
- 做局部聚合
- 去后綴
- 全局聚合
以上面的wordcount為例，找出相應的數據量比較大的單詞
代碼實現：

object SparkSqlTest {
    def main(args: Array[String]): Unit = {
        //屏蔽多余的日志
        Logger.getLogger("org.apache.hadoop").setLevel(Level.WARN)
        Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
        Logger.getLogger("org.project-spark").setLevel(Level.WARN)
        //構建編程入口
        val conf: SparkConf = new SparkConf()
        conf.setAppName("SparkSqlTest")
            .setMaster("local[2]")

        val spark: SparkSession = SparkSession.builder().config(conf)
            .enableHiveSupport()
            .getOrCreate()
        //創建sqlcontext對象
        val sqlContext: SQLContext = spark.sqlContext
        //注冊UDF
        sqlContext.udf.register[String,String,Integer]("add_prefix",add_prefix)
        sqlContext.udf.register[String,String]("remove_prefix",remove_prefix)
        //創建sparkContext對象
        val sc: SparkContext = spark.sparkContext
        val lineRDD: RDD[String] = sc.textFile("C:\\z_data\\test_data\\ip.txt")
        //找出數據傾斜的單詞
        val wordsRDD: RDD[String] = lineRDD.flatMap(line => {
            line.split("\\s+")
        })
        val sampleRDD: RDD[String] = wordsRDD.sample(false,0.2)
        val sortRDD: RDD[(String, Int)] = sampleRDD.map(word=>(word,1)).reduceByKey(_+_).sortBy(kv=>kv._2,false)
        val hot_word = sortRDD.take(1)(0)._1
        val bs: Broadcast[String] = sc.broadcast(hot_word)

        import spark.implicits._
        //將數據傾斜的key打標簽
        val lineDF: DataFrame = sqlContext.read.text("C:\\z_data\\test_data\\ip.txt")
        val wordDF: Dataset[String] = lineDF.flatMap(row => {
            row.getAs[String](0).split("\\s+")
        })
        //有數據傾斜的word
        val hotDS: Dataset[String] = wordDF.filter(row => {
            val hot_word = bs.value
            row.equals(hot_word)
        })
        val hotDF: DataFrame = hotDS.toDF("word")
        hotDF.createTempView("hot_table")
        //沒有數據傾斜的word
        val norDS: Dataset[String] = wordDF.filter(row => {
            val hot_word = bs.value
            !row.equals(hot_word)
        })
        val norDF: DataFrame = norDS.toDF("word")
        norDF.createTempView("nor_table")
        var sql=
            """
              |(select
              |t3.word,
              |sum(t3.counts) counts
              |from (select
              |remove_prefix(t2.newword) word,
              |t2.counts
              |from (select
              |t1.newword newword,
              |count(1) counts
              |from
              |(select
              |add_prefix(word,3) newword
              |from hot_table) t1
              |group by t1.newword) t2) t3
              |group by t3.word)
              |union
              |(select
              | word,
              | count(1) counts
              |from nor_table
              |group by word)
            """.stripMargin
        spark.sql(sql).show()

    }
    //自定義UDF加前綴
    def add_prefix(word:String,range:Integer): String ={
        val random=new Random()
        random.nextInt(range)+"_"+word
    }
    //自定義UDF去除后綴
    def remove_prefix(word:String): String ={
        word.substring(word.indexOf("_")+1)
    }
}

結果：
spark-sql的進階案例

向AI問一下細節

91超碰碰碰碰久久久久久综合_超碰av人澡人澡人澡人澡人掠_国产黄大片在线观看画质优化_txt小说免费全本

spark-sql的進階案例

（1）骨灰級案例--UDTF求wordcount

（2）窗口函數求topN

（3）SparkSQL去處理DataSkew數據傾斜的問題

猜你喜歡

91超碰碰碰碰久久久久久综合_超碰av人澡人澡人澡人澡人掠_国产黄大片在线观看画质优化_txt小说免费全本

spark-sql的進階案例

（1）骨灰級案例--UDTF求wordcount

（2）窗口函數求topN

（3）SparkSQL去處理DataSkew數據傾斜的問題

猜你喜歡

最新資訊

相關推薦

相關標簽