基于Spark对消费者行为数据进行数据分析开发案例

原创/朱季谦

在日常工作当中，经常遇到基于Spa rk去读取存储在HDFS中的批量文件数据进行统计分析的案例，这些文件一般以csv或者txt 文件格式存在。例如，存在这样一份消费者行为数据，字段包括消费者姓名,年龄,性别,月薪,消费偏好,消费领域,购物平台,支付方式,单次购买商品数量,优惠券获取情况,购物动机。

基于这份消费者行为数据，往往会有以下一些分析目标：

针对这些需求，就可以使用Spa rk来读取文件后，进一步分析处理统计。

Amy Harris,39,男,18561,性价比,家居用品,天猫,微信支付,10,折扣优惠,品牌忠诚
Lori Willis,33,女,14071,功能性,家居用品,苏宁易购,货到付款,1,折扣优惠,日常使用
Jim Williams,61,男,14145,时尚潮流,汽车配件,淘宝,微信支付,3,免费赠品,礼物赠送
Anthony Perez,19,女,11587,时尚潮流,珠宝首饰,拼多多,支付宝,5,免费赠品,商品推荐
......

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local").setAppName("consumer")
    val ss = SparkSession.builder().config(conf).getOrCreate()
    val filePath: String = "src/main/resources/consumerdata.csv"
    val consumerRDD = ss.sparkContext.textFile(filePath).map(_.split(","))

    consumerRDD.foreach(x =&gt; {
      x.foreach(y =&gt; print(y +" "))
      println()
    })

消费者姓名	年龄	性别	月薪	消费偏好	消费领域	购物平台	支付方式	单次购买商品数量	优惠券获取情况	购物动机
Amy Harris	39	男	18561	性价比	家居用品	天猫	微信支付	10	折扣优惠	品牌忠诚
Lori Willis	33	女	14071	功能性	家居用品	苏宁易购	货到付款	1	折扣优惠	日常使用
。。。

consumerRDD.map(x =&gt; (x.apply(7),1)).reduceByKey(_ + _).sortBy(_._2, false).foreach(println)

consumerRDD.map(x =&gt; (x.apply(5), 1)).reduceByKey(_ + _).sortBy(_._2, false).foreach(println)

consumerRDD.map(x =&gt; (x.apply(4), 1)).reduceByKey(_ + _).sortBy(_._2, false).foreach(println)

consumerRDD.map(x => (x.apply(10), 1)).reduceByKey(_ + _).sortBy(_._2, false).foreach(println)

//取出consumerRDD每一行数组需要的字段
val rowRDD = consumerRDD.map{
  x => Row(x.apply(0),x.apply(1).toInt,x.apply(2),x.apply(3).toInt,x.apply(4),x.apply(5),x.apply(6),x.apply(7),x.apply(8).toInt,x.apply(9),x.apply(10))
}

//设置字段映射
val schema = StructType(Seq(
  StructField("consumerName", StringType),
  StructField("age", IntegerType),
  StructField("gender", StringType),
  StructField("monthlyIncome", IntegerType),
  StructField("consumptionPreference", StringType),
  StructField("consumptionArea", StringType),
  StructField("shoppingPlatform", StringType),
  StructField("paymentMethod", StringType),
  StructField("quantityOfItemsPurchased", IntegerType),
  StructField("couponAcquisitionStatus", StringType),
  StructField("shoppingMotivation", StringType)

))
val df = ss.createDataFrame(rowRDD, schema).toDF()
//按年龄分布计算
val agedf = df.withColumn("age_range",
  when(col("age").between(0, 20), "0-20")
    .when(col("age").between(21, 30), "21-30")
    .when(col("age").between(31, 40), "31-40")
    .when(col("age").between(41, 50), "41-50")
    .when(col("age").between(51, 60), "51-60")
    .when(col("age").between(61, 70), "61-70")
    .when(col("age").between(81, 90), "81-90")
    .when(col("age").between(91, 100), "91-100")
    .otherwise("Unknow")
)
//分组统计
val result = agedf.groupBy("age_range").agg(count("consumerName").alias("Count")).sort(desc("Count"))
result.show()

val sexResult = agedf.groupBy("gender").agg(count("consumerName").alias("Count")).sort(desc("Count"))
sexResult.show()

显示所有内容

声明：本站所有文章，如无特殊说明或标注，均为本站原创发布。任何个人或组织，在未征得本站同意时，禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益，可联系我们进行处理。

spark 消费者行为

一、统计消费者支付方式偏好分布

二、统计购物平台偏好分布

三、统计购物偏好方式分布

四、统计购物动机分布

五、消费者年龄分布

六、统计年龄分布

发表回复取消回复

一、统计消费者支付方式偏好分布

二、统计购物平台偏好分布

三、统计购物偏好方式分布

四、统计购物动机分布

五、消费者年龄分布

六、统计年龄分布

相关文章

发表回复 取消回复

发表回复取消回复