Skip to main content
 print this page

Using Magic Command in Glue PySpark Notebooks

This guide explains how to properly use the %extra_jars magic command in both SageMaker Studio and Notebook Glue PySpark notebooks to load additional JAR files for JDBC drivers and custom classes.

Issue: JAR Loading in Glue Interactive Sessions

When working with Glue Interactive Sessions (Studio), you may encounter errors when trying to load JAR files that work fine in AWS Glue Jobs. This happens because of fundamental differences in how Spark sessions are initialized between Glue Jobs and Glue Interactive Sessions.

Root Cause

In Glue Jobs:

  • JARs can be attached before the job starts using:
    • The "Dependent JARs" field in the AWS Glue Console, or
    • The --extra-jars argument when defining the job
  • These JARs are included in the classpath during JVM startup
  • This is when Spark loads JDBC drivers and other custom classes
  • Drivers like com.example.jdbc.CustomDriver work without issues

In Glue Interactive Sessions:

  • The Spark kernel and session are already initialized before your code runs
  • When you write code like:
    spark = SparkSession.builder.config("spark.jars", "...").getOrCreate()
  • You're trying to modify Spark config after the session has been created
  • This does not affect the current JVM or the executor classpath
  • Any JARs you attempt to attach this way are not actually loaded
  • JDBC drivers are not found, causing errors

Solution: Using %extra_jars Magic Command

The solution is to use the %extra_jars magic command before running any Python code that creates a Spark session.

Step-by-Step Process

  1. Open your Jupyter notebook with Glue PySpark kernel

  2. Configure JAR files using magic command first:

    %extra_jars "s3://your-bucket/path/to/your.jar"
  3. Run your Python code after the magic command:

    import boto3
    from pyspark.sql import SparkSession

    # Verify JARs are loaded
    print(spark.sparkContext._jsc.sc().listJars())

    # Test driver class loading
    try:
    spark._jvm.java.lang.Class.forName("com.example.jdbc.CustomDriver")
    print("Driver class loaded successfully.")
    except Exception as e:
    print("Failed to load driver class:", e)

Complete Example: Loading Custom JAR Files

Here's a complete example showing how to load custom JAR files using the %extra_jars magic command:

# Step 1: Configure JAR file using magic command
%extra_jars "s3://<amorphic-etl-bucket>/common-libs/<library-id>/libs/java/custom-driver.jar"

# Step 2: Import required libraries
import boto3
from pyspark.sql import SparkSession

# Step 3: Verify JARs are loaded
print("Available JARs in classpath:")
print(spark.sparkContext._jsc.sc().listJars())

# Step 4: Test driver class loading
try:
spark._jvm.java.lang.Class.forName("com.example.jdbc.CustomDriver")
print("Driver class loaded successfully.")
except Exception as e:
print("Failed to load driver class:", e)

# Step 5: Example of using custom functionality
try:
# Example: Create a custom object using the loaded JAR
custom_object = spark._jvm.com.example.CustomClass()
print("Custom object created successfully using loaded JAR.")

# Example: Access custom methods
result = custom_object.someMethod()
print(f"Custom method executed successfully. Result: {result}")

except Exception as e:
print("Failed to use custom functionality:", e)

# Step 6: Example: Working with custom data types
try:
# Example: Using custom data types from the JAR
custom_data_type = spark._jvm.com.example.CustomDataType()
print("Custom data type created successfully.")

except Exception as e:
print("Failed to create custom data type:", e)

Important Notes

Session Initialization Order

Understanding the session initialization process is crucial:

  1. By default, after starting a Jupyter Spark notebook, running any cell creates a new SparkSession with default settings and no extra JARs (even a simple import creates a new Spark session)

  2. Magic commands (%) run code without starting a Spark/Glue session - they configure settings/parameters without initiating a Spark session

  3. Configuration sequence:

    • First: Configure all settings or JARs using magic commands without running any Python code
    • Then: Run your Python code - it will pick up the changes from the previous step
  4. If you run Python code first, then use magic commands, those changes won't be reflected in the current session. You'll need to stop and restart the session to reflect new changes.

Best Practices

  • Always use magic commands before running any Python code that creates or uses a Spark session
  • Verify JAR loading by checking spark.sparkContext._jsc.sc().listJars()
  • Test driver class loading before attempting database connections
  • If you need to add JARs after running code, restart the session

Multiple JAR Files

You can specify multiple JAR files using comma separation:

%extra_jars "s3://<amorphic-etl-bucket>/common-libs/<library-id1>/libs/java/jar1.jar,s3://<amorphic-etl-bucket>/common-libs/<library-id2>/libs/java/jar2.jar"

Supported Environments

This approach works in both:

  • SageMaker Studio with Glue PySpark kernel
  • SageMaker Notebooks with Interactive Sessions enabled

Troubleshooting

If you encounter issues:

  1. Verify JAR path: Ensure the S3 path is correct and accessible
  2. Check session state: Make sure you're not in an active Spark session when using magic commands
  3. Restart session: If changes don't take effect, restart the kernel/session
  4. Verify permissions: Ensure your notebook has access to the S3 bucket containing the JAR files
Note

The %extra_jars magic command is specific to Glue Interactive Sessions and is not available in regular Spark notebooks or Glue Jobs.