Spark On Hive配置测试及分布式SQL ThriftServer配置

本文介绍: 分布式SQL执行引擎就是使用Spark提供的ThriftServer服务，以“后台进程”的模式持续运行，对外提供端口。SQL提交后，底层运行的就是Spark任务。相当于构建了一个以MetaStore服务为元数据，Spark为执行引擎的数据库服务，像操作数据库那样方便的操作SparkSQL进行分布式的SQL计算。

<configuration>
    <!-- 告知spark创建表位置 -->
    <property>
        <name>hive.metastore.warehouse.dir</name>
        <value>/user/hive/warehouse</value>
    </property>

    <!-- 告知spark hive metastore位置 -->
    <property>
        <name>hive.metastore.uris</name>
        <value>thrift://node1:9083</value>
    </property>
</configuration>

<!-- 远程模式部署metastore metastore地址 -->
<property>
    <name>hive.metastore.uris</name>
    <value>thrift://node1:9083</value>
</property>

 nohup bin/hive --service metastore 2>&1 >> /export/server/apache-hive-3.1.2/log/metastore.log &

# 确保metastore服务是开启的
if __name__ == '__main__':
    ss = SparkSession.builder 
        .appName("test") 
        .master("local[*]") 
        .config("spark.sql.shuffle.partitions", 2) 
        .config("spark.sql.warehouse.dir", "hdfs://node1:8020/user/hive/warehouse/") 
        .config("hive.metastore.uris", "thrift://node1:9083") 
        .enableHiveSupport() 
        .getOrCreate()
    sc = ss.sparkContext

    ss.sql('''select * from student''').show()

./start-thriftserver.sh --hiveconf hive.server2.thrift.port=10000 --hiveconf hive.server2.thrift.bind.host=node1 --master local[2]

yum install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel libffi-devel gcc make gcc-c++ python-devel cyrus-sasl-devel cyrus-sasl-plain cyrus-sasl-gssapi -y
pip install pyhive pymysql sasl thrift thrift_sasl

from pyhive import hive

if __name__ == '__main__':
    # 获取到Hive（Spark Thrift连接）
    conn = hive.connect(host="node1", port=10000, username="root")
    # 获取一个游标对象
    cursor = conn.cursor()
    # 执行SQL
    cursor.execute("SELECT * FROM student")
    # 通过fetchall函数返回结果
    res = cursor.fetchall()
    print(res)