VS Code Extension · v0.8.1

Catch Spark performance
issues before production

CatalystOps analyzes your PySpark and Databricks code inline as you type — 30+ anti-pattern detectors, dry-run plan analysis on your cluster, and actionable fixes. No context switching.

pipeline.py
1 from pyspark.sql import functions as F
2
3 events = spark.read.parquet("s3://bucket/events/")
4 daily = events.groupBy("date").count() ← count() triggers full scan
5 hourly = events.collect() ← OOM risk on large datasets
6
7 result = daily.join(F.broadcast(small_df), "date")
CODE_COUNT_001 · Warning
count() > 0 triggers a full Spark job just to check emptiness. Use df.isEmpty() instead — it short-circuits on the first partition.
Quick fix
if not df.isEmpty(): ...
30+
Anti-pattern rules
0
Cluster needed for local analysis
2
Execution modes — cluster & serverless
Free
MIT licensed
Features

Everything in one place

From static code analysis to live Databricks plan inspection — without leaving the editor.

Local Static Analysis

Detects 30+ PySpark and Databricks anti-patterns inline as you type. No cluster required. Catches collect(), UDFs, cross joins, unsafe writes, SQL injection, schema drift, and more.

Dry-Run Plan Analysis

Submits a neutralized version of your script to Databricks (cluster or Serverless) and returns the physical Catalyst plan — with sort-merge join detection, broadcast thresholds, shuffle analysis, and cost estimation.

Explain Plan Tree & DAG

Interactive sidebar tree of the physical plan with per-node cost scores. One-click DAG webview. Context-aware quick fixes directly on plan nodes — broadcast hint, repartition, persist, AQE config.

Billing Dashboard

Tracks DBU and dollar spend per period directly from system.billing.usage with a 1-hour cache. After each serverless run, optionally fetches actual DBU consumption.

Schema Validation

Tracks inferred schemas across DataFrames. Validates join column existence and type compatibility. Detects union column-order mismatches that silently corrupt data at runtime.

MCP Server

Exposes a Streamable HTTP MCP server auto-discovered by VS Code 1.99+. Lets Claude and other AI tools analyze your PySpark code, fetch billing summaries, and run dry runs through natural language.

How it works

From install to insight in minutes

1

Install & open a Python file

Install from the VS Code Marketplace. The moment you open a .py file, local analysis kicks in — no configuration needed. 30+ rules light up immediately for any PySpark anti-patterns.

2

Connect to Databricks (optional)

Add your workspace URL and personal access token via CatalystOps: Configure Databricks Connection. Pick cluster or Serverless execution mode. CatalystOps reads your ~/.databrickscfg automatically if it exists.

3

Run a dry-run analysis

Press ⌘⇧K to submit the current file. CatalystOps neutralizes side-effects, executes the Catalyst planner on your cluster, and returns the physical plan with cost annotations, join strategies, and actionable fixes — all in the sidebar.

Detection rules

What CatalystOps catches

Rules span Spark actions, joins, streaming, Delta Lake, DLT pipelines, and security.

Critical / Data correctness
  • SQL injection via f-string in spark.sql()
  • Kafka auto-commit enabled
  • APPLY AS DELETE WHEN wrong clause order (CDC)
  • union() / intersect() column order mismatch
  • Streaming inner join silently drops events
  • FLOAT / DOUBLE for financial columns
Performance
  • collect() / toPandas() on large DataFrames
  • Cross join / cartesian product
  • Global orderBy (full shuffle)
  • withColumn() inside a loop
  • Window.orderBy() without partitionBy()
  • Repeated source scan without cache
Streaming
  • No .trigger() — continuous micro-batches
  • groupBy() without watermark (unbounded state)
  • Dynamic allocation on streaming cluster
  • Stateful streaming without RocksDB
  • Auto Loader without maxBytesPerTrigger
  • Checkpoint stored on DBFS
Delta Lake & DLT
  • MERGE without Deletion Vectors
  • MERGE without Row-Level Concurrency
  • OPTIMIZE after every MERGE (latency spikes)
  • DROP TABLE + CREATE TABLE (non-atomic)
  • DLT PARTITION BY instead of Liquid Clustering
  • read_files() without schemaHints
MCP Integration

Talk to your Spark pipeline with Claude

CatalystOps ships a built-in MCP server auto-discovered by VS Code 1.99+. Claude and other AI clients can call CatalystOps tools directly — analyze code, fetch billing data, run dry-runs, and read plan results through natural language.

  • analyze_pyspark — run local analysis on any code snippet
  • run_dry_run — trigger a Databricks dry-run from chat
  • get_billing_summary — ask about DBU spend in plain English
  • get_plan_analysis — read last dry-run plan results from Claude
Claude chat
You
What issues are in my active file?
Claude
↳ calling get_active_file_issues…
Found 3 issues in pipeline.py:

Line 12 — collect() pulls all data to the driver. OOM risk on large datasets.
Line 18 — Global orderBy shuffles all data to one partition.
Line 31 — Streaming query has no .trigger() — continuous micro-batches.
You
Fix the streaming issue and run a dry run
Claude
↳ calling run_dry_run…
Dry run complete. The plan shows a Sort-Merge Join on line 8 — one side is small enough to broadcast. Want me to add a broadcast hint?

Start optimizing today

Free, open-source, and available for any Databricks or PySpark project.

Also available on Open VSX for Cursor, Theia, and other editors.