VS Code Extension · v0.8.1

Catch Spark performance
issues before production

CatalystOps analyzes your PySpark and Databricks code inline as you type — 30+ anti-pattern detectors, dry-run plan analysis on your cluster, and actionable fixes. No context switching.

Install from VS Code Marketplace View on GitHub

pipeline.py

1 from pyspark.sql import functions as F

3 events = spark.read.parquet("s3://bucket/events/")

4 daily = events.groupBy("date").count() ← count() triggers full scan

5 hourly = events.collect() ← OOM risk on large datasets

7 result = daily.join(F.broadcast(small_df), "date")

Features

Everything in one place

From static code analysis to live Databricks plan inspection — without leaving the editor.

Local Static Analysis

Detects 30+ PySpark and Databricks anti-patterns inline as you type. No cluster required. Catches collect(), UDFs, cross joins, unsafe writes, SQL injection, schema drift, and more.

Dry-Run Plan Analysis

Submits a neutralized version of your script to Databricks (cluster or Serverless) and returns the physical Catalyst plan — with sort-merge join detection, broadcast thresholds, shuffle analysis, and cost estimation.

Explain Plan Tree & DAG

Interactive sidebar tree of the physical plan with per-node cost scores. One-click DAG webview. Context-aware quick fixes directly on plan nodes — broadcast hint, repartition, persist, AQE config.

Billing Dashboard

Tracks DBU and dollar spend per period directly from system.billing.usage with a 1-hour cache. After each serverless run, optionally fetches actual DBU consumption.

Schema Validation

Tracks inferred schemas across DataFrames. Validates join column existence and type compatibility. Detects union column-order mismatches that silently corrupt data at runtime.

MCP Server

Exposes a Streamable HTTP MCP server auto-discovered by VS Code 1.99+. Lets Claude and other AI tools analyze your PySpark code, fetch billing summaries, and run dry runs through natural language.

How it works

From install to insight in minutes

Install & open a Python file

Install from the VS Code Marketplace. The moment you open a .py file, local analysis kicks in — no configuration needed. 30+ rules light up immediately for any PySpark anti-patterns.

Connect to Databricks (optional)

Add your workspace URL and personal access token via CatalystOps: Configure Databricks Connection. Pick cluster or Serverless execution mode. CatalystOps reads your ~/.databrickscfg automatically if it exists.

Run a dry-run analysis

Press ⌘⇧K to submit the current file. CatalystOps neutralizes side-effects, executes the Catalyst planner on your cluster, and returns the physical plan with cost annotations, join strategies, and actionable fixes — all in the sidebar.

Detection rules

What CatalystOps catches

Rules span Spark actions, joins, streaming, Delta Lake, DLT pipelines, and security.

Critical / Data correctness

● SQL injection via f-string in spark.sql()
● Kafka auto-commit enabled
● APPLY AS DELETE WHEN wrong clause order (CDC)
● union() / intersect() column order mismatch
● Streaming inner join silently drops events
● FLOAT / DOUBLE for financial columns

Performance

● collect() / toPandas() on large DataFrames
● Cross join / cartesian product
● Global orderBy (full shuffle)
● withColumn() inside a loop
● Window.orderBy() without partitionBy()
● Repeated source scan without cache

Streaming

● No .trigger() — continuous micro-batches
● groupBy() without watermark (unbounded state)
● Dynamic allocation on streaming cluster
● Stateful streaming without RocksDB
● Auto Loader without maxBytesPerTrigger
● Checkpoint stored on DBFS

Delta Lake & DLT

● MERGE without Deletion Vectors
● MERGE without Row-Level Concurrency
● OPTIMIZE after every MERGE (latency spikes)
● DROP TABLE + CREATE TABLE (non-atomic)
● DLT PARTITION BY instead of Liquid Clustering
● read_files() without schemaHints

MCP Integration

Talk to your Spark pipeline with Claude

CatalystOps ships a built-in MCP server auto-discovered by VS Code 1.99+. Claude and other AI clients can call CatalystOps tools directly — analyze code, fetch billing data, run dry-runs, and read plan results through natural language.

analyze_pyspark — run local analysis on any code snippet
run_dry_run — trigger a Databricks dry-run from chat
get_billing_summary — ask about DBU spend in plain English
get_plan_analysis — read last dry-run plan results from Claude

Claude chat

You

What issues are in my active file?

Claude

↳ calling get_active_file_issues…

Found 3 issues in pipeline.py:

● Line 12 — collect() pulls all data to the driver. OOM risk on large datasets.
● Line 18 — Global orderBy shuffles all data to one partition.
● Line 31 — Streaming query has no .trigger() — continuous micro-batches.

You

Fix the streaming issue and run a dry run