Using Magic Command in Glue PySpark Notebooks

This guide explains how to properly use the %extra_jars magic command in both SageMaker Studio and Notebook Glue PySpark notebooks to load additional JAR files for JDBC drivers and custom classes.

Issue: JAR Loading in Glue Interactive Sessions

When working with Glue Interactive Sessions (Studio), you may encounter errors when trying to load JAR files that work fine in AWS Glue Jobs. This happens because of fundamental differences in how Spark sessions are initialized between Glue Jobs and Glue Interactive Sessions.

Root Cause

In Glue Jobs:

JARs can be attached before the job starts using:
- The "Dependent JARs" field in the AWS Glue Console, or
- The --extra-jars argument when defining the job
These JARs are included in the classpath during JVM startup
This is when Spark loads JDBC drivers and other custom classes
Drivers like com.example.jdbc.CustomDriver work without issues

In Glue Interactive Sessions:

The Spark kernel and session are already initialized before your code runs

When you write code like:

spark = SparkSession.builder.config("spark.jars", "...").getOrCreate()

You're trying to modify Spark config after the session has been created
This does not affect the current JVM or the executor classpath
Any JARs you attempt to attach this way are not actually loaded
JDBC drivers are not found, causing errors

Solution: Using %extra_jars Magic Command

The solution is to use the %extra_jars magic command before running any Python code that creates a Spark session.

Step-by-Step Process

Open your Jupyter notebook with Glue PySpark kernel

Configure JAR files using magic command first:

%extra_jars "s3://your-bucket/path/to/your.jar"

Run your Python code after the magic command:

import boto3
from pyspark.sql import SparkSession

# Verify JARs are loaded
print(spark.sparkContext._jsc.sc().listJars())

# Test driver class loading
try:
    spark._jvm.java.lang.Class.forName("com.example.jdbc.CustomDriver")
    print("Driver class loaded successfully.")
except Exception as e:
    print("Failed to load driver class:", e)

Complete Example: Loading Custom JAR Files

Here's a complete example showing how to load custom JAR files using the %extra_jars magic command:

# Step 1: Configure JAR file using magic command
%extra_jars "s3://<amorphic-etl-bucket>/common-libs/<library-id>/libs/java/custom-driver.jar"

# Step 2: Import required libraries
import boto3
from pyspark.sql import SparkSession

# Step 3: Verify JARs are loaded
print("Available JARs in classpath:")
print(spark.sparkContext._jsc.sc().listJars())

# Step 4: Test driver class loading
try:
    spark._jvm.java.lang.Class.forName("com.example.jdbc.CustomDriver")
    print("Driver class loaded successfully.")
except Exception as e:
    print("Failed to load driver class:", e)

# Step 5: Example of using custom functionality
try:
    # Example: Create a custom object using the loaded JAR
    custom_object = spark._jvm.com.example.CustomClass()
    print("Custom object created successfully using loaded JAR.")

    # Example: Access custom methods
    result = custom_object.someMethod()
    print(f"Custom method executed successfully. Result: {result}")

except Exception as e:
    print("Failed to use custom functionality:", e)

# Step 6: Example: Working with custom data types
try:
    # Example: Using custom data types from the JAR
    custom_data_type = spark._jvm.com.example.CustomDataType()
    print("Custom data type created successfully.")

except Exception as e:
    print("Failed to create custom data type:", e)

Important Notes

Session Initialization Order

Understanding the session initialization process is crucial:

By default, after starting a Jupyter Spark notebook, running any cell creates a new SparkSession with default settings and no extra JARs (even a simple import creates a new Spark session)
Magic commands (%) run code without starting a Spark/Glue session - they configure settings/parameters without initiating a Spark session
Configuration sequence:
- First: Configure all settings or JARs using magic commands without running any Python code
- Then: Run your Python code - it will pick up the changes from the previous step
If you run Python code first, then use magic commands, those changes won't be reflected in the current session. You'll need to stop and restart the session to reflect new changes.

Best Practices

Always use magic commands before running any Python code that creates or uses a Spark session
Verify JAR loading by checking spark.sparkContext._jsc.sc().listJars()
Test driver class loading before attempting database connections
If you need to add JARs after running code, restart the session

Multiple JAR Files

You can specify multiple JAR files using comma separation:

%extra_jars "s3://<amorphic-etl-bucket>/common-libs/<library-id1>/libs/java/jar1.jar,s3://<amorphic-etl-bucket>/common-libs/<library-id2>/libs/java/jar2.jar"

Supported Environments

This approach works in both:

SageMaker Studio with Glue PySpark kernel
SageMaker Notebooks with Interactive Sessions enabled

Troubleshooting

If you encounter issues:

Verify JAR path: Ensure the S3 path is correct and accessible
Check session state: Make sure you're not in an active Spark session when using magic commands
Restart session: If changes don't take effect, restart the kernel/session
Verify permissions: Ensure your notebook has access to the S3 bucket containing the JAR files

Note

The %extra_jars magic command is specific to Glue Interactive Sessions and is not available in regular Spark notebooks or Glue Jobs.

Issue: JAR Loading in Glue Interactive Sessions​

Root Cause​

Solution: Using %extra_jars Magic Command​

Step-by-Step Process​

Complete Example: Loading Custom JAR Files​

Important Notes​

Session Initialization Order​

Best Practices​

Multiple JAR Files​

Supported Environments​

Troubleshooting​