Skip to main content

Apache Spark setup

profiles.yml file is for dbt Core users only

If you're using dbt Cloud, you don't need to create a profiles.yml file. This file is only for dbt Core users. To connect your data platform to dbt Cloud, refer to About data platforms.

If you're using Databricks, use dbt-databricks

If you're using Databricks, the dbt-databricks adapter is recommended over dbt-spark. If you're still using dbt-spark with Databricks consider migrating from the dbt-spark adapter to the dbt-databricks adapter.

For the Databricks version of this page, refer to Databricks setup.

  • Maintained by: dbt Labs
  • Authors: core dbt maintainers
  • GitHub repo: dbt-labs/dbt-spark
  • PyPI package: dbt-spark
  • Slack channel: db-databricks-and-spark
  • Supported dbt Core version: v0.15.0 and newer
  • dbt Cloud support: Supported
  • Minimum data platform version: n/a

Installing dbt-spark

Use pip to install the adapter. Before 1.8, installing the adapter would automatically install dbt-core and any additional dependencies. Beginning in 1.8, installing an adapter does not automatically install dbt-core. This is because adapters and dbt Core versions have been decoupled from each other so we no longer want to overwrite existing dbt-core installations. Use the following command for installation:

python -m pip install dbt-core dbt-spark

Configuring dbt-spark

For Spark-specific configuration, please refer to Spark configs.

If connecting to Databricks via ODBC driver, it requires pyodbc. Depending on your system, you can install it seperately or via pip. See the pyodbc wiki for OS-specific installation details.

If connecting to a Spark cluster via the generic thrift or http methods, it requires PyHive.

# odbc connections
$ python -m pip install "dbt-spark[ODBC]"

# thrift or http connections
$ python -m pip install "dbt-spark[PyHive]"
# session connections
$ python -m pip install "dbt-spark[session]"

Configuring dbt-spark

For Spark-specific configuration please refer to Spark Configuration

For further info, refer to the GitHub repository: dbt-labs/dbt-spark

Connection methods

dbt-spark can connect to Spark clusters by four different methods:

  • odbc is the preferred method when connecting to Databricks. It supports connecting to a SQL Endpoint or an all-purpose interactive cluster.

  • thrift connects directly to the lead node of a cluster, either locally hosted / on premise or in the cloud (e.g. Amazon EMR).

  • http is a more generic method for connecting to a managed service that provides an HTTP endpoint. Currently, this includes connections to a Databricks interactive cluster.

  • session connects to a pySpark session, running locally or on a remote machine.

Advanced functionality

The session connection method is intended for advanced users and experimental dbt development. This connection method is not supported by dbt Cloud.

ODBC

Use the odbc connection method if you are connecting to a Databricks SQL endpoint or interactive cluster via ODBC driver. (Download the latest version of the official driver here.)

~/.dbt/profiles.yml
your_profile_name:
target: dev
outputs:
dev:
type: spark
method: odbc
driver: [path/to/driver]
schema: [database/schema name]
host: [yourorg.sparkhost.com]
organization: [org id] # Azure Databricks only
token: [abc123]

# one of:
endpoint: [endpoint id]
cluster: [cluster id]

# optional
port: [port] # default 443
user: [user]
server_side_parameters:
"spark.driver.memory": "4g"

Thrift

Use the thrift connection method if you are connecting to a Thrift server sitting in front of a Spark cluster, e.g. a cluster running locally or on Amazon EMR.

~/.dbt/profiles.yml
your_profile_name:
target: dev
outputs:
dev:
type: spark
method: thrift
schema: [database/schema name]
host: [hostname]

# optional
port: [port] # default 10001
user: [user]
auth: [e.g. KERBEROS]
kerberos_service_name: [e.g. hive]
use_ssl: [true|false] # value of hive.server2.use.SSL, default false
server_side_parameters:
"spark.driver.memory": "4g"

HTTP

Use the http method if your Spark provider supports generic connections over HTTP (e.g. Databricks interactive cluster).

~/.dbt/profiles.yml
your_profile_name:
target: dev
outputs:
dev:
type: spark
method: http
schema: [database/schema name]
host: [yourorg.sparkhost.com]
organization: [org id] # Azure Databricks only
token: [abc123]
cluster: [cluster id]

# optional
port: [port] # default: 443
user: [user]
connect_timeout: 60 # default 10
connect_retries: 5 # default 0
server_side_parameters:
"spark.driver.memory": "4g"

Databricks interactive clusters can take several minutes to start up. You may include the optional profile configs connect_timeout and connect_retries, and dbt will periodically retry the connection.

Session

Use the session method if you want to run dbt against a pySpark session.

~/.dbt/profiles.yml
your_profile_name:
target: dev
outputs:
dev:
type: spark
method: session
schema: [database/schema name]
host: NA # not used, but required by `dbt-core`
server_side_parameters:
"spark.driver.memory": "4g"

Optional configurations

Retries

Intermittent errors can crop up unexpectedly while running queries against Apache Spark. If retry_all is enabled, dbt-spark will naively retry any query that fails, based on the configuration supplied by connect_timeout and connect_retries. It does not attempt to determine if the query failure was transient or likely to succeed on retry. This configuration is recommended in production environments, where queries ought to be succeeding.

For instance, this will instruct dbt to retry all failed queries up to 3 times, with a 5 second delay between each retry:

~/.dbt/profiles.yml
retry_all: true
connect_timeout: 5
connect_retries: 3

Server side configuration

Spark can be customized using Application Properties. Using these properties the execution can be customized, for example, to allocate more memory to the driver process. Also, the Spark SQL runtime can be set through these properties. For example, this allows the user to set a Spark catalogs.

Caveats

When facing difficulties, run poetry run dbt debug --log-level=debug. The logs are saved at logs/dbt.log.

Usage with EMR

To connect to Apache Spark running on an Amazon EMR cluster, you will need to run sudo /usr/lib/spark/sbin/start-thriftserver.sh on the master node of the cluster to start the Thrift server (see the docs for more information). You will also need to connect to port 10001, which will connect to the Spark backend Thrift server; port 10000 will instead connect to a Hive backend, which will not work correctly with dbt.

Supported functionality

Most dbt Core functionality is supported, but some features are only available on Delta Lake (Databricks).

Delta-only features:

  1. Incremental model updates by unique_key instead of partition_by (see merge strategy)
  2. Snapshots
  3. Persisting column-level descriptions as database comments

Default namespace with Thrift connection method

To run metadata queries in dbt, you need to have a namespace named default in Spark when connecting with Thrift. You can check available namespaces by using Spark's pyspark and running spark.sql("SHOW NAMESPACES").show(). If the default namespace doesn't exist, create it by running spark.sql("CREATE NAMESPACE default").show().

If there's a network connection issue, your logs will display an error like Could not connect to any of [('127.0.0.1', 10000)] (or something similar).

0