-
Notifications
You must be signed in to change notification settings - Fork 68
Home
A library to load data into Apache Spark™ SQL DataFrames from Apache Hive™ using LLAP. With Apache Ranger™, this library provides row/column level fine-grained access controls.
-
Shared Policies: The data in a cluster can be shared securely and consistenly controlled by the shared access rules between Apache Spark™ and Apache Hive™.
-
Audits: All security activities can be monitored and searched in a single place, i.e., Apache Ranger™
-
Resources: Each user can use different queues while accessing the secured Hive data.
For all use cases, make it sure that the permission of Hive warehouse is 700.
It means non-hive
user like spark
can not access the secured tables.
$ hadoop fs -ls /apps/hive/
Found 1 items
drwx------ - hive hdfs 0 2017-02-01 20:52 /apps/hive/warehouse
In addition, make it sure that hive.warehouse.subdir.inherit.perms=true
.
Newly created tables will inherit the permission by default.
Run Spark Thrift Server with LLAP as hive
. Then, Apache Ranger policies
rule Spark Thrift Server and Hive Thrift Server together seamlessly
as a single control center.
In the building section, we will describe how to patch and how to build. For testing, refer the test document.
A non-Hive user also runs spark-shell
or pyspark
like the followings.
The user can see only the accessible data.
$ bin/spark-shell --jars spark-llap_2.11-1.0.3-2.1.jar --conf spark.sql.hive.llap=true
scala> sql("show databases").show()
+------------+
|databaseName|
+------------+
| db_spark|
+------------+
$ bin/pyspark --jars spark-llap_2.11-1.0.3-2.1.jar --conf spark.sql.hive.llap=true
>>> sql("show databases").show()
+------------+
|databaseName|
+------------+
| db_spark|
+------------+
A non-Hive user also can submit his spark job like the following.
Note that it will fail without spark.sql.hive.llap=true
configuration.
You can find the full examples at examples/src/main/python/spark_llap_sql.py
.
spark = SparkSession \
.builder \
.appName("Spark LLAP SQL Python") \
.master("yarn") \
.enableHiveSupport() \
.config("spark.sql.hive.llap", "true") \
.getOrCreate()
spark.sql("show databases").show()
spark.sql("select * from db_spark.t_spark").show()
spark.stop()
You need HiveServer2 Interactive service. For example, in HDP 2.5 or HDP 2.6 Preview,
navigate to Hive
-> Configs
-> Settings
-> Interactive Query
and turn on Enable Interactive Query (Tech Preview)
.
If you want to use access control, you need to setup Apache Ranger policies. The followings are some example policies.
Name | Table | Column | Permissions |
---|---|---|---|
spark_access | t_spark | * | Select |
Name | Table | Column | Access Types | Select Masking Option |
---|---|---|---|---|
spark_mask | t_spark | name | Select | partial mask:'show first 4' |
Name | Table | Access Types | Row Level Filter |
---|---|---|---|
spark_filter | t_spark | Select | gender='M' |
This is for a temporary space for Spark while executing INSERT INTO
.
Name | Database | Table | Column | Permissions |
---|---|---|---|---|
spark_system | default | tmp_* | * | All |
You can download the pre-built library at https://github.com/hortonworks-spark/spark-llap/releases .
To build spark-llap
from the source, do the following.
git clone https://github.com/hortonworks-spark/spark-llap.git -b branch-2.1
cd spark-llap
build/sbt assembly
git clone https://github.com/apache/spark.git -b branch-2.1
cd spark
curl https://github.com/raw/hortonworks-spark/spark-llap/branch-2.1/patch/0001-SPARK-LLAP-RANGER-Integration.patch | git am
build/sbt -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver package
Copy your existing hive-site.xml
and spark-defaults.conf
into Apache Spark conf
folder.
Add the following three configurations to spark-defaults.conf
..
spark.sql.hive.hiveserver2.url jdbc:hive2://YourHiveServer2URL:10500
spark.hadoop.hive.llap.daemon.service.hosts *value for hive.llap.daemon.service.hosts in Hive configuration*
spark.hadoop.hive.zookeeper.quorum *value for hive.zookeeper.quorum in Hive configuraion*
Start Spark Thrift Server with spark.sql.hive.llap=true
.
sbin/start-thriftserver.sh --jars spark-llap_2.11-1.0.3-2.1.jar --conf spark.sql.hive.llap=true
You can turn off spark-llap
by restarting Spark Thrift Server without this option or give spark.sql.hive.llap=false
.
It is recommended to run Spark Thrift Server as user hive
to use more SQL features.
You can access spark-llap
enabled Spark Thrift Server via beeline
or Apache Zeppelin™.
beeline -u jdbc:hive2://localhost:10016 -n hive -p password -e 'show databases'
beeline -u jdbc:hive2://localhost:10016 -n spark -p password -e 'show tables'
There are two simple Python examples to submit jobs using Spark-LLAP.
examples/src/main/python/spark_llap_sql.py
examples/src/main/python/spark_llap_dsl.py
When using on kerberized clusters, add the followings into Custom hive-interactive-site
.
hive.llap.task.principal=hive/_HOST@EXAMPLE.COM
hive.llap.task.keytab.file=/etc/security/keytabs/hive.service.keytab
You can access spark-llap
enabled Spark Thrift Server via beeline
or Apache Zeppelin™.
beeline -u "jdbc:hive2://hostname:10500/;principal=hive/_HOST@EXAMPLE.COM;hive.server2.proxy.user=hive" -p password -e 'show tables'
beeline -u "jdbc:hive2://hostname:10500/;principal=hive/_HOST@EXAMPLE.COM;hive.server2.proxy.user=spark" -p password -e 'show tables'