<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://vatchechamlian.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://vatchechamlian.com/" rel="alternate" type="text/html" /><updated>2026-02-10T20:19:44+00:00</updated><id>https://vatchechamlian.com/feed.xml</id><title type="html">Vatché Chamlian</title><subtitle>Thinker, Tinker, AI Builder</subtitle><author><name>Vatché</name></author><entry><title type="html">Your RAG Pipeline is Lying to You: Understanding the RAG Triad</title><link href="https://vatchechamlian.com/rag-triad-evaluation.html" rel="alternate" type="text/html" title="Your RAG Pipeline is Lying to You: Understanding the RAG Triad" /><published>2026-02-10T00:00:00+00:00</published><updated>2026-02-10T00:00:00+00:00</updated><id>https://vatchechamlian.com/rag-triad-evaluation</id><content type="html" xml:base="https://vatchechamlian.com/rag-triad-evaluation.html"><![CDATA[<p>About two and a half years ago I wrote an article called <a href="./ai-gets-cheat-sheet-retrieval-augmented-generation-explained.html">AI Gets a Cheat Sheet: Retrieval Augmented Generation Explained</a> where I broke down what RAG is, how retrieval technologies work, and why getting your data prepped matters before you can even think about plugging this stuff in. That article was focused on the <em>what</em> and the <em>why</em> of RAG, and at the time the ecosystem was still pretty early. People were just starting to wrap their heads around vector databases, semantic search, and how all of it tied back to foundation models.</p>

<p>A lot has changed since then. RAG has gone from being a novel technique that most people hadn’t heard of to something that is now embedded (pun intended) in production systems across industries. Companies have built RAG-powered chatbots, internal knowledge assistants, customer support systems, and document retrieval tools. The problem is that many of these systems were stood up quickly, and the teams running them don’t have a reliable way to know when things are going wrong, or more importantly, <em>why</em> things are going wrong.</p>

<p>That’s where the RAG Triad comes in.</p>

<p>Before we get into the framework itself, I want to make sure the terminology is clear since a few of these terms are going to come up repeatedly throughout the article. When we talk about <strong>retrieval</strong> in the context of RAG, we’re referring to the step where the system searches through your documents, knowledge base, or other data sources to find the most relevant pieces of information based on a user’s question. Think of it as the system going to look something up before it tries to answer. <strong>Generation</strong> is the step where a large language model (often referred to as an LLM, which is the AI model doing the heavy lifting, like GPT or Claude) takes whatever was retrieved and uses it to construct a response in natural language. The retrieval step finds the information, and the generation step turns that information into an answer. Everything we’re going to talk about below evaluates how well those two steps are working together, and where things tend to fall apart.</p>

<h2 id="what-is-the-rag-triad">What is the RAG Triad?</h2>

<p>The RAG Triad is an evaluation framework that breaks down the quality of a RAG system into three distinct metrics: <strong>Context Relevance</strong>, <strong>Faithfulness</strong> (sometimes called Groundedness), and <strong>Answer Relevance</strong>. Each metric evaluates a different stage of the pipeline, and each one can fail independently of the others. This is important because when your RAG system gives a bad answer, the root cause could be in your retrieval step, your generation step, or both, and the fix is completely different depending on which one is actually broken.</p>

<p>I’ve said this before in the context of security and I’ll say it again here: you can’t fix what you can’t measure. If you’re tweaking prompts and hoping the output gets better without actually measuring which part of the system is failing, you’re essentially guessing. And guessing at scale, especially when these systems are customer-facing, is a great way to erode trust fast.</p>

<h2 id="the-three-legs-and-how-each-one-breaks">The Three Legs (and How Each One Breaks)</h2>

<h3 id="context-relevance">Context Relevance</h3>

<p>Context relevance asks a simple question: did your retrieval step actually pull back the right documents for the user’s query? This is the foundation of everything, because if the wrong context gets fed into the LLM, it doesn’t matter how good your model is or how tight your prompt engineering is. The answer will be wrong from the start.</p>

<p>Here’s an example that illustrates the problem well. A user asks “What’s our refund policy for enterprise customers?” and the system retrieves your general FAQ page instead of the enterprise contract terms. The model generates a response that sounds confident and well-structured, but it’s based on the wrong source material. The user might not even realize the answer is incorrect, and that’s the dangerous part. The system didn’t hallucinate in the traditional sense; it faithfully summarized the wrong documents.</p>

<p>When context relevance scores are low, the problem lives in your retrieval pipeline, and the people who need to be looking at this are typically your data engineers or ML engineers, whoever built and maintains the system that decides which documents get pulled for a given query.</p>

<p>The things they’d look at start with the <strong>chunking strategy</strong>. Chunking is the process of breaking your documents into smaller pieces before they get stored in the system. When a user asks a question, the system doesn’t search through entire documents; it searches through these smaller chunks and returns the ones that seem most relevant. If your chunks are too large, the system might return a big block of text where only one sentence is actually relevant, which dilutes the quality of what the LLM has to work with. If your chunks are too small, you might lose important context because a key piece of information got split across two separate chunks that the system doesn’t know belong together.</p>

<p>They’d also look at the <strong>embedding model</strong>, which is the component that converts both your document chunks and the user’s query into numerical representations (called vectors) so the system can measure how similar they are. If your embedding model wasn’t trained on content that resembles your domain-specific data, it might not understand the relationships between terms that matter in your industry, and the similarity scores it produces won’t reflect actual relevance.</p>

<p>Another common fix is adding a <strong>re-ranker</strong> between your initial search results and the LLM. The way this works is that your vector search might return, say, the top 20 chunks that seem relevant based on similarity scores, but similarity doesn’t always equal relevance. A re-ranker (tools like Cohere Rerank are popular for this) takes those initial results and rescores them using a more sophisticated model that’s specifically designed to evaluate whether a piece of text actually answers a given question. The result is that the chunks that make it to the LLM are much more likely to be the right ones.</p>

<p>You might also look at <strong>query rewriting</strong>, which is where you transform the user’s natural language question into something that maps better to how your documents are structured. A user might ask a casual question in a way that doesn’t match the terminology used in your knowledge base, and rewriting the query before it hits the search step can dramatically improve what comes back.</p>

<h3 id="faithfulness-groundedness">Faithfulness (Groundedness)</h3>

<p>Faithfulness measures whether the LLM stuck to what the retrieved context actually says, or whether it started making things up. This is the hallucination metric, and it’s probably the one that keeps most teams up at night because the outputs can look incredibly convincing even when they’re fabricated.</p>

<p>The retrieved documents might be exactly right, but the model adds details, invents statistics, or draws conclusions that aren’t supported by anything in the context. For instance, the context might say “revenue grew in Q3” and the model outputs “revenue grew 15% in Q3” when no percentage was mentioned anywhere in the source material. That 15% sounds specific and credible, which makes it more dangerous than an obviously wrong answer would be.</p>

<p>When faithfulness scores are low, the problem is in your generation step, not your retrieval. This is where the person managing the LLM configuration and the <strong>system prompt</strong> (the set of instructions that tell the model how to behave, what to prioritize, and what constraints to follow) needs to step in. In many teams this is a developer or an AI/ML engineer, but increasingly it’s also someone in a prompt engineering or AI product role.</p>

<p>The fixes here look different from the retrieval side. You’d start by tightening the system prompt to explicitly instruct the model to only use the provided context when generating its response, and to say “I don’t know” or “I don’t have enough information to answer that” when the context doesn’t contain what’s needed. You’d also want to actually test whether the model follows through on that instruction, because telling it to do something and having it consistently do it are two different things.</p>

<p>Another lever is <strong>temperature</strong>, which is a setting that controls how much creative freedom the LLM has when generating text. A higher temperature means the model is more likely to introduce variety and take liberties with its phrasing, which is great for creative writing but not what you want when the goal is to faithfully represent source material. Lowering the temperature makes the model more conservative and more likely to stick closely to what it was given.</p>

<p>You can also improve faithfulness by filtering out low-relevance chunks before they even reach the LLM. If the model receives five chunks but only two of them are actually relevant, the other three become noise that the model might draw from when constructing its answer, increasing the chance it says something that isn’t grounded in the right information.</p>

<h3 id="answer-relevance">Answer Relevance</h3>

<p>Answer relevance is the one that tends to sneak past teams because the system can retrieve decent context, not hallucinate at all, and still completely miss the point of what the user was asking. Someone asks “How do I cancel my subscription?” and gets a technically accurate paragraph about subscription pricing tiers. The information is correct, the model didn’t make anything up, but it didn’t answer the actual question.</p>

<p>This is often a query understanding problem, and figuring out who owns the fix can be tricky because it could be a retrieval issue, a generation issue, or both. If the retrieval step returned documents about pricing instead of cancellation, that’s a retrieval problem and falls back to the data/ML engineering team. But if the retrieval step returned the right documents and the model just chose to summarize the wrong part, that’s a generation and prompt engineering problem.</p>

<p>The system misinterprets what the user is actually asking for, or the prompt doesn’t sufficiently guide the model to focus on answering the specific question rather than just summarizing whatever relevant-looking context it received. <strong>Query classification</strong> and <strong>intent detection</strong> (techniques that try to understand what the user is really asking before the search even happens) can help on the retrieval side, while better prompt engineering can address the generation side. Sometimes the fix is as straightforward as restructuring your system prompt to tell the model “answer the user’s specific question, don’t just summarize the context.”</p>

<h2 id="how-you-actually-evaluate-this">How You Actually Evaluate This</h2>

<p>The most common approach today is using an LLM as a judge. You take a separate LLM call (not the same one generating your answers) and use it to evaluate the output of your RAG pipeline. For each query, you capture three things: the original question, the retrieved context chunks, and the generated answer. Then you send those to the evaluator model with specific prompts designed for each metric.</p>

<p>For context relevance, you’d ask something like “Given this user question, rate how relevant each retrieved chunk is on a scale of 0 to 1.” For faithfulness, you’d ask “Can every claim in this answer be directly traced back to the retrieved context?” For answer relevance, you’d ask “Does this answer actually address what the user was asking?”</p>

<p>Tools like <a href="https://docs.ragas.io/">RAGAS</a>, <a href="https://docs.confident-ai.com/">DeepEval</a>, and <a href="https://www.trulens.org/">TruLens</a> have these evaluator prompts built in so you don’t have to write them from scratch. You run them against a dataset of real or synthetic queries and get scores for each dimension. This is not a one-time exercise either; you want to be running these evaluations regularly, especially as your document corpus changes, as your chunking strategy evolves, or as you swap out models.</p>

<p>You can also layer in human evaluation where actual people review a sample of outputs. This is more expensive and slower, but it catches things that LLM judges miss. Most mature teams that I’ve talked to do both: automated eval for coverage and speed, human eval for calibration and edge cases.</p>

<h2 id="the-key-insight">The Key Insight</h2>

<p>The thing that I keep coming back to is that the RAG Triad gives you a diagnostic framework, not just a quality score. When something goes wrong (and it will), you can pinpoint whether the problem is in retrieval, generation, or query understanding, and then target your fix accordingly. This is fundamentally different from the approach I see a lot of teams taking, which is essentially just tweaking prompts randomly and hoping the output looks better on the next batch of test queries.</p>

<p>It reminds me a lot of what I’ve seen in security over the years. You can’t just throw tools at a problem and hope it goes away. You need to understand the system as a whole, identify where the actual vulnerability is, and then apply the right remediation. The RAG Triad is that diagnostic layer for your AI pipeline.</p>

<p>If you’re building or maintaining RAG systems and haven’t started thinking about structured evaluation yet, I’d strongly recommend looking into RAGAS or DeepEval as a starting point. The setup time is minimal compared to the visibility you get, and once you can see which metric is failing, the path forward becomes a lot clearer.</p>

<p>As always, let me know what you think. If you’re evaluating RAG systems in production, I’d love to hear what’s working and what’s not. You can find me on <a href="https://www.linkedin.com/in/chamlian/">LinkedIn</a>.</p>]]></content><author><name>Vatché</name></author><category term="AI" /><category term="Data Science &amp; Analytics" /><category term="Artificial Intelligence" /><category term="Data Science &amp; Analytics" /><category term="RAG" /><category term="Technical Tutorials" /><summary type="html"><![CDATA[A practical guide to the RAG Triad evaluation framework, breaking down Context Relevance, Faithfulness, and Answer Relevance as diagnostic metrics for identifying and fixing failures in retrieval-augmented generation pipelines.]]></summary></entry><entry><title type="html">Building a Cloud-Connected Pwnagotchi: From Raspberry Pi to AWS Lambda</title><link href="https://vatchechamlian.com/pwnagotchi-build.html" rel="alternate" type="text/html" title="Building a Cloud-Connected Pwnagotchi: From Raspberry Pi to AWS Lambda" /><published>2026-01-28T00:00:00+00:00</published><updated>2026-01-28T00:00:00+00:00</updated><id>https://vatchechamlian.com/pwnagotchi-build</id><content type="html" xml:base="https://vatchechamlian.com/pwnagotchi-build.html"><![CDATA[<p>When you watch movies about hacking, the hack is always seamless, things just work,those of us that have worked in this space know that is not the case. There are endless variables that can go wrong and cause the most simple of tasks to become a frustrating ordeal and it is not until you have done something a bunch of times that you understand the nuances of how to proceed. When I discovered the pwnagotchi—an AI-powered WiFi security tool that runs on a Raspberry Pi—I knew I had to build one. After using it for a few months I was frustrated with the process of having to manually download the pcap files and run hashcat myself. For a while I thought about how I wanted to take it further, I wanted it to communicate with me via my phone (like the Flipper Zero). I also wanted to be able to crack the handshakes automatically, but I did not want to have to deal with the effort of setting up and managing a cracking rig.</p>

<p>What if every WiFi handshake my pwnagotchi captured could automatically upload to the cloud? What if I could get real-time notifications on my phone and decide which networks to crack with a simple Telegram command? What if the cracking happened on AWS spot instances instead of my subpar laptop? This would save me time and effort of going back and forth to the device, copying files, and running hashcat on a device that is not nearly as powerful as some cloud instances.</p>

<h2 id="what-is-a-pwnagotchi">What is a Pwnagotchi?</h2>

<p>If you’re not familiar, a <a href="https://pwnagotchi.ai/">pwnagotchi</a> is a “pet” that learns from WiFi networks around it. It’s built on a Raspberry Pi Zero and uses AI to optimize its handshake capture strategies. Think of it as a Tamagotchi for hackers—except instead of feeding it virtual food, you’re feeding it WiFi handshakes.</p>

<p>There are already great tutorials out there for building a basic pwnagotchi, so I won’t rehash the standard setup. I used <a href="https://github.com/jayofelony/pwnagotchi">jayofelony’s fork</a> which includes updated plugins, better hardware compatibility, and active maintenance, this made the initial setup much smoother than the original.</p>

<p>Instead, this post focuses on what I added on top: a complete cloud pipeline for automated cracking.</p>

<h2 id="the-hardware-stack">The Hardware Stack</h2>

<p>Let’s start with what I’m running:</p>

<ul>
  <li>
    <p><strong>Raspberry Pi Zero WH</strong> – The brains of the operation
<img src="/assets/img/posts/20260128/pizero-wh.jpg" alt="Raspberry Pi Zero WH" /></p>
  </li>
  <li>
    <p><strong>Waveshare 2.13” e-Paper Display (V4)</strong> – For that classic pwnagotchi face
<img src="/assets/img/posts/20260128/waveshare-hat.jpg" alt="Waveshare 2.13&quot; e-Paper Display (V4)" />
<img src="/assets/img/posts/20260128/waveshare-screen.jpg" alt="Waveshare 2.13&quot; e-Paper Display (V4)" /></p>
  </li>
  <li>
    <p><strong>PiSugar Battery Module</strong> – Older V2 model, with 5V output.
<img src="/assets/img/posts/20260128/pisugar.jpg" alt="piSugar Battery Module" /></p>
  </li>
  <li>
    <p><strong>3D Printed Case with Button</strong> – Protection and style
<img src="/assets/img/posts/20260128/PLA-printed-case.jpg" alt="PLA-printed-case" /></p>
  </li>
  <li>
    <p><strong>Bluetooth Tethering</strong> – Connected to my phone for internet access</p>
  </li>
</ul>

<p>The Waveshare display is perfect for this project. It’s low power, highly visible even in sunlight, and gives the pwnagotchi that retro digital pet aesthetic, I love the faces this thing makes.</p>

<h2 id="assembly-tips">Assembly Tips</h2>

<p>When it comes to the assembly, there are a few things that I think help. First, knowing when the pwnagotchi is booting is difficult with a case, but if you use the transparent nut/bolt that comes with the piSugar, it will magnify the green LED, which makes it much easier to see what is going on.</p>

<p><img src="/assets/img/posts/20260128/example-of-led-being-brighter.jpg" alt="piSugar LED" /></p>

<p>Second, I recommend using a 3D printed case. There are so many variations on the build that it’s best to find one that you like and print it out, I ended up going with PETG for the sake of durability, but PLA is fine for prototyping, in the images below you will see the PLA case I poorly printed during prototyping.</p>

<p>Third, ensure that you have a USB A to Micro USB with data transfer. Setting this up is infinitely easier when you can plug the Pi directly into your computer and see the console output, ssh into it, and SCP files back and forth.</p>

<h2 id="the-architecture">The Architecture</h2>

<p>Most pwnagotchi setups require you to manually SSH in, grab the PCAP files, and run hashcat yourself. There is a service that can be used to crack the pcap files, <a href="https://wpa-sec.stanev.org/">Distributed WPA PSK auditor</a>, but I wanted a fully automated pipeline and a quicker turnaround.</p>

<p>Here’s how it works:</p>

<ol>
  <li><strong>Pwnagotchi captures a WiFi handshake</strong> and saves the PCAP file</li>
  <li><strong>Custom S3 upload plugin</strong> automatically uploads the file to an S3 bucket’s <code class="language-plaintext highlighter-rouge">staging/</code> folder</li>
  <li><strong>Lambda function triggers</strong> when a new file arrives it then sends a Telegram notification</li>
  <li><strong>I receive a message</strong> on my phone with the network SSID, BSSID, and job ID</li>
  <li><strong>I reply via Telegram</strong> with <code class="language-plaintext highlighter-rouge">/approve [job-id]</code> or <code class="language-plaintext highlighter-rouge">/reject [job-id]</code> (clicking on the approve or rejected link copies it to your clipboard)</li>
  <li><strong>Another Lambda function</strong> processes my command and moves the file to <code class="language-plaintext highlighter-rouge">approved/</code> or <code class="language-plaintext highlighter-rouge">rejected/</code></li>
  <li><strong>A job launcher Lambda</strong> detects approved files and spins up an EC2 spot instance</li>
  <li><strong>The EC2 instance</strong> runs GPU-accelerated hashcat with the rockyou wordlist</li>
  <li><strong>Results are uploaded</strong> back to S3 and I get a Telegram notification with the cracked password</li>
</ol>

<p>The entire system is serverless except for the short-lived EC2 instances that actually do the cracking. This keeps costs low, I only pay for compute when I’m actively cracking a handshake.</p>

<h2 id="code-deep-dive">Code Deep Dive</h2>

<p>Let me show you some of the key pieces. All the sensitive info has been replaced with placeholders, but you’ll get the idea.</p>

<h3 id="terraform-s3-bucket-structure">Terraform: S3 Bucket Structure</h3>

<p>The foundation of this system is a well-organized S3 bucket. I used Terraform to define the entire infrastructure as code:</p>

<div class="language-hcl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Main S3 bucket for storing pcap files</span>
<span class="nx">resource</span> <span class="s2">"aws_s3_bucket"</span> <span class="s2">"pcap_bucket"</span> <span class="p">{</span>
  <span class="nx">bucket</span> <span class="p">=</span> <span class="nx">local</span><span class="err">.</span><span class="nx">bucket_name</span>

  <span class="nx">tags</span> <span class="p">=</span> <span class="nx">merge</span><span class="err">(</span><span class="nx">var</span><span class="err">.</span><span class="nx">tags</span><span class="err">,</span> <span class="p">{</span>
    <span class="nx">Name</span> <span class="p">=</span> <span class="s2">"${local.name_prefix}-bucket"</span>
  <span class="p">}</span><span class="err">)</span>
<span class="p">}</span>

<span class="c1"># Enable server-side encryption</span>
<span class="nx">resource</span> <span class="s2">"aws_s3_bucket_server_side_encryption_configuration"</span> <span class="s2">"pcap_bucket"</span> <span class="p">{</span>
  <span class="nx">bucket</span> <span class="p">=</span> <span class="nx">aws_s3_bucket</span><span class="err">.</span><span class="nx">pcap_bucket</span><span class="err">.</span><span class="nx">id</span>

  <span class="nx">rule</span> <span class="p">{</span>
    <span class="nx">apply_server_side_encryption_by_default</span> <span class="p">{</span>
      <span class="nx">sse_algorithm</span> <span class="p">=</span> <span class="s2">"AES256"</span>
    <span class="p">}</span>
  <span class="p">}</span>
<span class="p">}</span>

<span class="c1"># Block public access (security best practice)</span>
<span class="nx">resource</span> <span class="s2">"aws_s3_bucket_public_access_block"</span> <span class="s2">"pcap_bucket"</span> <span class="p">{</span>
  <span class="nx">bucket</span> <span class="p">=</span> <span class="nx">aws_s3_bucket</span><span class="err">.</span><span class="nx">pcap_bucket</span><span class="err">.</span><span class="nx">id</span>

  <span class="nx">block_public_acls</span>       <span class="p">=</span> <span class="kc">true</span>
  <span class="nx">block_public_policy</span>     <span class="p">=</span> <span class="kc">true</span>
  <span class="nx">ignore_public_acls</span>      <span class="p">=</span> <span class="kc">true</span>
  <span class="nx">restrict_public_buckets</span> <span class="p">=</span> <span class="kc">true</span>
<span class="p">}</span>

<span class="c1"># Lifecycle policy to auto-delete old files</span>
<span class="nx">resource</span> <span class="s2">"aws_s3_bucket_lifecycle_configuration"</span> <span class="s2">"pcap_bucket"</span> <span class="p">{</span>
  <span class="nx">bucket</span> <span class="p">=</span> <span class="nx">aws_s3_bucket</span><span class="err">.</span><span class="nx">pcap_bucket</span><span class="err">.</span><span class="nx">id</span>

  <span class="c1"># Auto-delete completed pcaps after 30 days</span>
  <span class="nx">rule</span> <span class="p">{</span>
    <span class="nx">id</span>     <span class="p">=</span> <span class="s2">"delete-completed"</span>
    <span class="nx">status</span> <span class="p">=</span> <span class="s2">"Enabled"</span>

    <span class="nx">filter</span> <span class="p">{</span>
      <span class="nx">prefix</span> <span class="p">=</span> <span class="s2">"completed/"</span>
    <span class="p">}</span>

    <span class="nx">expiration</span> <span class="p">{</span>
      <span class="nx">days</span> <span class="p">=</span> <span class="nx">var</span><span class="err">.</span><span class="nx">s3_lifecycle_expiration_days</span>
    <span class="p">}</span>
  <span class="p">}</span>

  <span class="c1"># Auto-delete rejected pcaps after 30 days</span>
  <span class="nx">rule</span> <span class="p">{</span>
    <span class="nx">id</span>     <span class="p">=</span> <span class="s2">"delete-rejected"</span>
    <span class="nx">status</span> <span class="p">=</span> <span class="s2">"Enabled"</span>

    <span class="nx">filter</span> <span class="p">{</span>
      <span class="nx">prefix</span> <span class="p">=</span> <span class="s2">"rejected/"</span>
    <span class="p">}</span>

    <span class="nx">expiration</span> <span class="p">{</span>
      <span class="nx">days</span> <span class="p">=</span> <span class="nx">var</span><span class="err">.</span><span class="nx">s3_lifecycle_expiration_days</span>
    <span class="p">}</span>
  <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The folder structure is simple but effective:</p>
<ul>
  <li><code class="language-plaintext highlighter-rouge">staging/</code> – New uploads go here</li>
  <li><code class="language-plaintext highlighter-rouge">approved/</code> – Files I’ve approved for cracking</li>
  <li><code class="language-plaintext highlighter-rouge">processing/</code> – Currently being cracked</li>
  <li><code class="language-plaintext highlighter-rouge">completed/</code> – Finished jobs</li>
  <li><code class="language-plaintext highlighter-rouge">rejected/</code> – Networks I chose not to crack</li>
  <li><code class="language-plaintext highlighter-rouge">results/</code> – Cracked passwords and job metadata</li>
</ul>

<p>The lifecycle policies ensure I don’t pay for storage forever. Old files automatically delete after 30 days.</p>

<h3 id="lambda-upload-handler">Lambda: Upload Handler</h3>

<p>When a new PCAP file hits the <code class="language-plaintext highlighter-rouge">staging/</code> folder, this Lambda function begins to work:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="s">"""
Pwnagotchi PCAP Cracker - Upload Handler Lambda
Triggered when a new pcap is uploaded to S3 staging/ folder
Sends Telegram notification with approval/reject options
"""</span>

<span class="kn">import</span> <span class="nn">json</span>
<span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">boto3</span>
<span class="kn">import</span> <span class="nn">urllib3</span>
<span class="kn">from</span> <span class="nn">datetime</span> <span class="kn">import</span> <span class="n">datetime</span>
<span class="kn">import</span> <span class="nn">uuid</span>

<span class="c1"># Initialize AWS clients
</span><span class="n">s3</span> <span class="o">=</span> <span class="n">boto3</span><span class="p">.</span><span class="n">client</span><span class="p">(</span><span class="s">'s3'</span><span class="p">)</span>
<span class="n">http</span> <span class="o">=</span> <span class="n">urllib3</span><span class="p">.</span><span class="n">PoolManager</span><span class="p">()</span>

<span class="c1"># Environment variables (set in Lambda configuration)
</span><span class="n">TELEGRAM_BOT_TOKEN</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">environ</span><span class="p">[</span><span class="s">'TELEGRAM_BOT_TOKEN'</span><span class="p">]</span>  <span class="c1"># &lt;YOUR_BOT_TOKEN&gt;
</span><span class="n">TELEGRAM_CHAT_ID</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">environ</span><span class="p">[</span><span class="s">'TELEGRAM_CHAT_ID'</span><span class="p">]</span>      <span class="c1"># &lt;YOUR_CHAT_ID&gt;
</span><span class="n">S3_BUCKET</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">environ</span><span class="p">[</span><span class="s">'S3_BUCKET'</span><span class="p">]</span>                    <span class="c1"># &lt;YOUR_BUCKET_NAME&gt;
</span>

<span class="k">def</span> <span class="nf">extract_pcap_metadata</span><span class="p">(</span><span class="n">bucket</span><span class="p">,</span> <span class="n">key</span><span class="p">):</span>
    <span class="s">"""
    Extract SSID and BSSID from pcap file
    For MVP: parse filename (format: SSID_BSSID.pcap)
    """</span>
    <span class="n">filename</span> <span class="o">=</span> <span class="n">key</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">'/'</span><span class="p">)[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
    
    <span class="k">if</span> <span class="s">'_'</span> <span class="ow">in</span> <span class="n">filename</span> <span class="ow">and</span> <span class="n">filename</span><span class="p">.</span><span class="n">endswith</span><span class="p">(</span><span class="s">'.pcap'</span><span class="p">):</span>
        <span class="n">parts</span> <span class="o">=</span> <span class="n">filename</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="s">'.pcap'</span><span class="p">,</span> <span class="s">''</span><span class="p">).</span><span class="n">split</span><span class="p">(</span><span class="s">'_'</span><span class="p">)</span>
        <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">parts</span><span class="p">)</span> <span class="o">&gt;=</span> <span class="mi">2</span><span class="p">:</span>
            <span class="n">ssid</span> <span class="o">=</span> <span class="n">parts</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
            <span class="n">bssid</span> <span class="o">=</span> <span class="s">'_'</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">parts</span><span class="p">[</span><span class="mi">1</span><span class="p">:])</span>
            <span class="k">return</span> <span class="n">ssid</span><span class="p">,</span> <span class="n">bssid</span>
    
    <span class="c1"># Fallback: use filename as SSID
</span>    <span class="n">ssid</span> <span class="o">=</span> <span class="n">filename</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="s">'.pcap'</span><span class="p">,</span> <span class="s">''</span><span class="p">)</span>
    <span class="n">bssid</span> <span class="o">=</span> <span class="s">'unknown'</span>
    <span class="k">return</span> <span class="n">ssid</span><span class="p">,</span> <span class="n">bssid</span>


<span class="k">def</span> <span class="nf">send_telegram_message</span><span class="p">(</span><span class="n">message</span><span class="p">):</span>
    <span class="s">"""Send message to Telegram using Bot API"""</span>
    <span class="n">url</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"https://api.telegram.org/bot</span><span class="si">{</span><span class="n">TELEGRAM_BOT_TOKEN</span><span class="si">}</span><span class="s">/sendMessage"</span>
    
    <span class="n">payload</span> <span class="o">=</span> <span class="p">{</span>
        <span class="s">'chat_id'</span><span class="p">:</span> <span class="n">TELEGRAM_CHAT_ID</span><span class="p">,</span>
        <span class="s">'text'</span><span class="p">:</span> <span class="n">message</span><span class="p">,</span>
        <span class="s">'parse_mode'</span><span class="p">:</span> <span class="s">'Markdown'</span>
    <span class="p">}</span>
    
    <span class="n">response</span> <span class="o">=</span> <span class="n">http</span><span class="p">.</span><span class="n">request</span><span class="p">(</span>
        <span class="s">'POST'</span><span class="p">,</span>
        <span class="n">url</span><span class="p">,</span>
        <span class="n">body</span><span class="o">=</span><span class="n">json</span><span class="p">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">payload</span><span class="p">).</span><span class="n">encode</span><span class="p">(</span><span class="s">'utf-8'</span><span class="p">),</span>
        <span class="n">headers</span><span class="o">=</span><span class="p">{</span><span class="s">'Content-Type'</span><span class="p">:</span> <span class="s">'application/json'</span><span class="p">}</span>
    <span class="p">)</span>
    
    <span class="k">if</span> <span class="n">response</span><span class="p">.</span><span class="n">status</span> <span class="o">!=</span> <span class="mi">200</span><span class="p">:</span>
        <span class="k">raise</span> <span class="nb">Exception</span><span class="p">(</span><span class="sa">f</span><span class="s">"Failed to send Telegram message: </span><span class="si">{</span><span class="n">response</span><span class="p">.</span><span class="n">status</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
    
    <span class="k">return</span> <span class="n">json</span><span class="p">.</span><span class="n">loads</span><span class="p">(</span><span class="n">response</span><span class="p">.</span><span class="n">data</span><span class="p">.</span><span class="n">decode</span><span class="p">(</span><span class="s">'utf-8'</span><span class="p">))</span>


<span class="k">def</span> <span class="nf">lambda_handler</span><span class="p">(</span><span class="n">event</span><span class="p">,</span> <span class="n">context</span><span class="p">):</span>
    <span class="s">"""Main Lambda handler"""</span>
    <span class="k">for</span> <span class="n">record</span> <span class="ow">in</span> <span class="n">event</span><span class="p">[</span><span class="s">'Records'</span><span class="p">]:</span>
        <span class="n">bucket</span> <span class="o">=</span> <span class="n">record</span><span class="p">[</span><span class="s">'s3'</span><span class="p">][</span><span class="s">'bucket'</span><span class="p">][</span><span class="s">'name'</span><span class="p">]</span>
        <span class="n">key</span> <span class="o">=</span> <span class="n">record</span><span class="p">[</span><span class="s">'s3'</span><span class="p">][</span><span class="s">'object'</span><span class="p">][</span><span class="s">'key'</span><span class="p">]</span>
        
        <span class="c1"># Only process .pcap files in staging/
</span>        <span class="k">if</span> <span class="ow">not</span> <span class="n">key</span><span class="p">.</span><span class="n">startswith</span><span class="p">(</span><span class="s">'staging/'</span><span class="p">)</span> <span class="ow">or</span> <span class="ow">not</span> <span class="n">key</span><span class="p">.</span><span class="n">endswith</span><span class="p">(</span><span class="s">'.pcap'</span><span class="p">):</span>
            <span class="k">continue</span>
        
        <span class="c1"># Generate unique job ID
</span>        <span class="n">job_id</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">uuid</span><span class="p">.</span><span class="n">uuid4</span><span class="p">())[:</span><span class="mi">8</span><span class="p">]</span>
        
        <span class="c1"># Extract metadata
</span>        <span class="n">ssid</span><span class="p">,</span> <span class="n">bssid</span> <span class="o">=</span> <span class="n">extract_pcap_metadata</span><span class="p">(</span><span class="n">bucket</span><span class="p">,</span> <span class="n">key</span><span class="p">)</span>
        
        <span class="c1"># Format Telegram message
</span>        <span class="n">message</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"""
🆕 *New Network Captured*

📡 *SSID:* `</span><span class="si">{</span><span class="n">ssid</span><span class="si">}</span><span class="s">`
🔗 *BSSID:* `</span><span class="si">{</span><span class="n">bssid</span><span class="si">}</span><span class="s">`
📦 *Job ID:* `</span><span class="si">{</span><span class="n">job_id</span><span class="si">}</span><span class="s">`
🕐 *Time:* </span><span class="si">{</span><span class="n">datetime</span><span class="p">.</span><span class="n">utcnow</span><span class="p">().</span><span class="n">strftime</span><span class="p">(</span><span class="s">'%Y-%m-%d %H</span><span class="si">:</span><span class="o">%</span><span class="n">M</span><span class="si">:</span><span class="o">%</span><span class="n">S</span><span class="s">')</span><span class="si">}</span><span class="s"> UTC

*Reply with:*
✅ `/approve </span><span class="si">{</span><span class="n">job_id</span><span class="si">}</span><span class="s">` - Start cracking
❌ `/reject </span><span class="si">{</span><span class="n">job_id</span><span class="si">}</span><span class="s">` - Ignore this network
"""</span>
        
        <span class="c1"># Send notification
</span>        <span class="n">send_telegram_message</span><span class="p">(</span><span class="n">message</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Telegram notification sent for job </span><span class="si">{</span><span class="n">job_id</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
    
    <span class="k">return</span> <span class="p">{</span><span class="s">'statusCode'</span><span class="p">:</span> <span class="mi">200</span><span class="p">,</span> <span class="s">'body'</span><span class="p">:</span> <span class="s">'Notifications sent'</span><span class="p">}</span>
</code></pre></div></div>

<p>The beauty of this setup is that I get an instant notification every time my pwnagotchi catches a new network and updates while it goes through the steps. With a lot of companies sharing spaces or having a few floors of a building there is a huge chance that you will get multiple handshakes from neighboring networks, which you are not authorized to attack. By having the ability to approve or reject a network I can ensure that I am only cracking handshakes that I am authorized to crack.</p>

<p><img src="/assets/img/posts/20260128/telegram-screenshot.png" alt="Screenshot of Telegram notification" /></p>

<p>Here’s what the notification looks like when a new network is captured:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>🆕 New Network Captured

📡 SSID: CoffeeShop_Guest
🔗 BSSID: a4:12:34:56:78:9a
📦 Job ID: f3a8b2c1
🕐 Time: 2026-01-28 14:32:15 UTC
📁 File: CoffeeShop_Guest_a4_12_34_56_78_9a.pcap

Reply with:
✅ /approve f3a8b2c1 - Start cracking
❌ /reject f3a8b2c1 - Ignore this network
</code></pre></div></div>

<p>I tap <code class="language-plaintext highlighter-rouge">/approve f3a8b2c1</code> and get an immediate confirmation:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>✅ Approved!

📦 Job ID: f3a8b2c1
📡 SSID: CoffeeShop_Guest
📁 Status: Moved to approved/

⚡ The cracking job will start automatically.
You'll receive a notification when complete.
</code></pre></div></div>

<p>A minute later, the EC2 instance spins up and I get a status update:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>⚙️ Cracking Started

📦 Job ID: f3a8b2c1
📡 SSID: CoffeeShop_Guest
🖥️ Instance: i-0a1b2c3d4e5f6g7h8
⏱️ Status: Processing...

Running hashcat on g4dn.xlarge
Estimated time: 5-15 minutes
</code></pre></div></div>

<p>And finally, when the password is cracked:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>🎉 Password Cracked!

📦 Job ID: f3a8b2c1
📡 SSID: CoffeeShop_Guest
🔓 Password: coffee2026

⏱️ Time taken: 8 minutes 32 seconds
💰 Cost: $0.42
📁 Full results saved to S3
</code></pre></div></div>

<p>Or, if it fails:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>❌ Cracking Failed

📦 Job ID: f3a8b2c1
📡 SSID: CoffeeShop_Guest

Password not found in rockyou.txt wordlist.
⏱️ Time taken: 12 minutes
💰 Cost: $0.62

Try a different wordlist or move on to the next target.
</code></pre></div></div>

<p>This real-time feedback is very helpful because it lets me know exactly what is going on with the pipeline, and if there are any issues I can issue a <code class="language-plaintext highlighter-rouge">/kill [job-id]</code> command to terminate the EC2 instance and investigate what went wrong, and the pcap file is still in S3 so I can try again later if I want to.</p>

<h3 id="lambda-telegram-webhook">Lambda: Telegram Webhook</h3>

<p>This Lambda function acts as a webhook endpoint for the Telegram bot. When I send a command like <code class="language-plaintext highlighter-rouge">/approve abc123</code>, it processes the request and moves files around in S3:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">handle_approve</span><span class="p">(</span><span class="n">job_id</span><span class="p">,</span> <span class="n">chat_id</span><span class="p">):</span>
    <span class="s">"""Handle /approve command"""</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Processing /approve for job </span><span class="si">{</span><span class="n">job_id</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
    
    <span class="c1"># Find the job files
</span>    <span class="n">pcap_key</span><span class="p">,</span> <span class="n">json_key</span> <span class="o">=</span> <span class="n">find_job_by_id</span><span class="p">(</span><span class="n">job_id</span><span class="p">)</span>
    
    <span class="k">if</span> <span class="ow">not</span> <span class="n">pcap_key</span><span class="p">:</span>
        <span class="n">send_telegram_message</span><span class="p">(</span>
            <span class="sa">f</span><span class="s">"❌ Job `</span><span class="si">{</span><span class="n">job_id</span><span class="si">}</span><span class="s">` not found. It may have already been processed."</span><span class="p">,</span>
            <span class="n">chat_id</span>
        <span class="p">)</span>
        <span class="k">return</span>
    
    <span class="k">try</span><span class="p">:</span>
        <span class="c1"># Move files from staging/ to approved/
</span>        <span class="n">dest_pcap</span><span class="p">,</span> <span class="n">dest_json</span> <span class="o">=</span> <span class="n">move_files</span><span class="p">(</span><span class="s">'staging'</span><span class="p">,</span> <span class="s">'approved'</span><span class="p">,</span> <span class="n">pcap_key</span><span class="p">,</span> <span class="n">json_key</span><span class="p">)</span>
        
        <span class="c1"># Update metadata status
</span>        <span class="n">metadata</span> <span class="o">=</span> <span class="n">update_metadata_status</span><span class="p">(</span><span class="n">dest_json</span><span class="p">,</span> <span class="s">'approved'</span><span class="p">)</span>
        
        <span class="c1"># Send confirmation
</span>        <span class="n">ssid</span> <span class="o">=</span> <span class="n">metadata</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'ssid'</span><span class="p">,</span> <span class="s">'Unknown'</span><span class="p">)</span>
        <span class="n">message</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"""
✅ *Approved!*

📦 *Job ID:* `</span><span class="si">{</span><span class="n">job_id</span><span class="si">}</span><span class="s">`
📡 *SSID:* `</span><span class="si">{</span><span class="n">ssid</span><span class="si">}</span><span class="s">`
📁 *Status:* Moved to approved/

⚡ The cracking job will start automatically.
You'll receive a notification when complete.
"""</span>
        <span class="n">send_telegram_message</span><span class="p">(</span><span class="n">message</span><span class="p">,</span> <span class="n">chat_id</span><span class="p">)</span>
        
    <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
        <span class="n">send_telegram_message</span><span class="p">(</span><span class="sa">f</span><span class="s">"❌ Error: </span><span class="si">{</span><span class="nb">str</span><span class="p">(</span><span class="n">e</span><span class="p">)</span><span class="si">}</span><span class="s">"</span><span class="p">,</span> <span class="n">chat_id</span><span class="p">)</span>
</code></pre></div></div>

<p>I also added <code class="language-plaintext highlighter-rouge">/status</code> and <code class="language-plaintext highlighter-rouge">/kill</code> commands. <code class="language-plaintext highlighter-rouge">/status [job-id]</code> shows where a job is in the pipeline, and <code class="language-plaintext highlighter-rouge">/kill [job-id]</code> terminates a running EC2 instance if I change my mind mid-crack or if it is taking too long.</p>

<h3 id="pwnagotchi-configuration">Pwnagotchi Configuration</h3>

<p>On the pwnagotchi side, the config is straightforward. Here’s a sanitized version of my <code class="language-plaintext highlighter-rouge">config.toml</code>:</p>

<div class="language-toml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Basic device configuration</span>
<span class="py">main.name</span> <span class="p">=</span> <span class="s">"[yourpwnagotchi]"</span>
<span class="py">main.whitelist</span> <span class="p">=</span> <span class="p">[</span>
    <span class="s">"[YourHomeNetwork]"</span><span class="p">,</span>
    <span class="s">"[YourNeighborsNetwork]"</span>
<span class="p">]</span>

<span class="c"># PiSugar battery support</span>
<span class="py">main.plugins.pisugar2.enabled</span> <span class="p">=</span> <span class="kc">true</span>

<span class="c"># Bluetooth tethering to phone</span>
<span class="py">main.plugins.bt-tether.enabled</span> <span class="p">=</span> <span class="kc">true</span>
<span class="py">main.plugins.bt-tether.share_internet</span> <span class="p">=</span> <span class="kc">true</span>
<span class="py">main.plugins.bt-tether.phone-name</span> <span class="p">=</span> <span class="s">"[YourPhone]"</span>
<span class="py">main.plugins.bt-tether.phone</span> <span class="p">=</span> <span class="s">"android"</span>
<span class="py">main.plugins.bt-tether.ip</span> <span class="p">=</span> <span class="s">"[YOUR:Phone:IP:Address]"</span>
<span class="py">main.plugins.bt-tether.mac</span> <span class="p">=</span> <span class="s">"[YOUR:MAC:ADDRESS]"</span>

<span class="c"># Waveshare display configuration</span>
<span class="py">ui.display.enabled</span> <span class="p">=</span> <span class="kc">true</span>
<span class="py">ui.display.type</span> <span class="p">=</span> <span class="s">"waveshare_4"</span>
<span class="py">ui.invert</span> <span class="p">=</span> <span class="kc">true</span>

<span class="c"># Web interface access (if you want to use the web interface)</span>
<span class="py">ui.web.address</span> <span class="p">=</span> <span class="s">"0.0.0.0"</span>
<span class="py">ui.web.username</span> <span class="p">=</span> <span class="s">"[admin]"</span>
<span class="py">ui.web.password</span> <span class="p">=</span> <span class="s">"[your_secure_password]"</span>
</code></pre></div></div>

<p>The S3 upload plugin I wrote isn’t included here, but it’s a simple Python script that watches the handshakes directory and uses boto3 to upload new PCAP files whenever they’re created.</p>

<h2 id="the-cracking-pipeline">The Cracking Pipeline</h2>

<p>I didn’t include the actual cracking script code here (it’s basically a bash wrapper around hashcat), but here’s what happens:</p>

<ol>
  <li><strong>EC2 Spot Instance Launches</strong> – I use g4dn.xlarge instances with NVIDIA GPUs</li>
  <li><strong>User Data Script Runs</strong> – Installs hashcat, aircrack-ng, and downloads the rockyou wordlist</li>
  <li><strong>PCAP Conversion</strong> – Converts the PCAP to a hashcat-compatible format</li>
  <li><strong>GPU-Accelerated Cracking</strong> – Runs hashcat with optimized settings</li>
  <li><strong>Results Upload</strong> – If a password is found, it’s uploaded to S3 <code class="language-plaintext highlighter-rouge">results/</code> folder</li>
  <li><strong>Telegram Notification</strong> – I get a message with the cracked password (or a failure notice)</li>
  <li><strong>Instance Terminates</strong> – No manual cleanup needed</li>
</ol>

<p>The whole thing is cost-effective because spot instances are cheap and I’m only running them for minutes at a time. A typical cracking job costs less than dollar.</p>

<h2 id="community-and-resources">Community and Resources</h2>

<p>If you’re interested in building your own pwnagotchi, here are some great resources:</p>

<ul>
  <li><strong>Official Site</strong>: <a href="https://pwnagotchi.ai/">pwnagotchi.ai</a> – Start here for the basics</li>
  <li><strong>jayofelony’s Fork</strong>: <a href="https://github.com/jayofelony/pwnagotchi">github.com/jayofelony/pwnagotchi</a></li>
  <li><strong>Discord</strong>: The pwnagotchi community is active and helpful (Pwnagotchi Unofficial)</li>
  <li><strong>Reddit</strong>: Several subreddits dedicated to WiFi security and pwnagotchi builds</li>
</ul>

<h2 id="final-thoughts">Final Thoughts</h2>

<p>This project taught me a lot about integrating hardware with serverless architecture. The pwnagotchi itself is a fun hardware hack, but connecting it to AWS Lambda, S3, and EC2 made it something that you can use regularly.</p>

<p>The key insight for me was this: <strong>you don’t need everything running 24/7</strong>. The pwnagotchi runs on battery. The S3 bucket just sits there. Lambda functions only charge for executions. EC2 instances spin up when needed and die when done.</p>

<p>It’s a model that works for a lot of side projects. Build the always-on parts cheap (or free), and use on-demand compute for the heavy lifting.</p>

<p>If you build something similar, I’d love to hear about it. And if you’re part of the pwnagotchi community, feel free to reach out, I’m always looking to learn new tricks.</p>

<p>Happy hacking! 🤖🔓</p>]]></content><author><name>Vatché</name></author><category term="AI" /><category term="Security" /><category term="AWS" /><category term="Hardware" /><category term="Raspberry Pi" /><category term="Pwnagotchi" /><category term="AWS Lambda" /><category term="Security Research" /><category term="Raspberry Pi" /><category term="Terraform" /><category term="Serverless" /><summary type="html"><![CDATA[A complete guide to building a cloud-connected Pwnagotchi that automatically uploads WiFi handshakes to S3, sends Telegram notifications for approval, and spins up EC2 spot instances for GPU-accelerated password cracking.]]></summary></entry><entry><title type="html">Design Thinking and Software Development: Why Product Management Skills Matter More Than Ever</title><link href="https://vatchechamlian.com/design-thinking-and-development.html" rel="alternate" type="text/html" title="Design Thinking and Software Development: Why Product Management Skills Matter More Than Ever" /><published>2026-01-21T00:00:00+00:00</published><updated>2026-01-21T00:00:00+00:00</updated><id>https://vatchechamlian.com/design-thinking-and-development</id><content type="html" xml:base="https://vatchechamlian.com/design-thinking-and-development.html"><![CDATA[<h1 id="why-product-management-matters">Why Product Management Matters</h1>

<p>I remember sitting in a conference room years ago, surrounded by product managers, developers, and designers. We were three sprints into a project and nobody could agree on what “done” actually meant. The requirements document was 47 pages long. The developers were building features that nobody asked for. The designers were frustrated. And the client? They just wanted something that solved their problem.</p>

<p>That project taught me something that I carry with me to this day: the quality of your input determines the quality of your output. Back then, we were talking about requirements documents and user stories. Today, we might be talking about prompts and AI-generated code. But the fundamental truth remains the same whether you are writing every line by hand or letting AI handle the typing.</p>

<p>Here is the thing: tools change, but the hard problem stays the same. You can mass produce code faster than ever, but that does not help if you are building the wrong thing. You cannot interview your users with a keyboard shortcut. You cannot synthesize conflicting feedback into a coherent product vision with a CLI command. You cannot decide which problem is worth solving first by asking an LLM.</p>

<p>That is product management. And in a world where writing code is getting easier every year, the ability to define what gets built clearly, precisely, testably, is the skill that matters most.</p>

<h2 id="the-problem-is-not-new-but-it-is-getting-louder">The Problem Is Not New, But It Is Getting Louder</h2>

<p>Y Combinator reported that 25% of startups in their Winter 2025 batch had codebases that were 95% AI-generated. That statistic makes a lot of people nervous about the future of software development. But here is what it actually reveals: the bottleneck was never typing speed.</p>

<p>The developers getting the best results, whether they are using AI tools or not, are the ones who actually understand what they are trying to build before they start.</p>

<p>In May 2025, a Swedish vibe coding app called Lovable was found to have security vulnerabilities in 170 out of 1,645 web applications it helped create. The code looked fine, it ran fine, but it was fundamentally broken because nobody stopped to ask the right questions first.</p>

<p>That is not an AI problem. That is a requirements problem. And requirements problems have been shipping broken software since long before anyone heard of large language models.</p>

<h2 id="enter-design-thinking-yes-the-sticky-note-people">Enter Design Thinking (Yes, the Sticky Note People)</h2>

<p>I know what you are thinking. “Design thinking? That is for UX designers and people who like whiteboards.”</p>

<p>The LUMA Institute has been teaching human-centered design for years. Their framework breaks down into three core skills: Looking, Understanding, and Making (A for aligning or adapting depending on who you ask). It is not complicated, but it is powerful, and it is exactly what most development processes are missing.</p>

<p><strong>Looking</strong> is about observing human experience. Who are your users? What are they actually trying to accomplish? What frustrates them?</p>

<p><strong>Understanding</strong> is about analyzing challenges and opportunities. What patterns emerge? What assumptions are you making?</p>

<p><strong>Making</strong> is about envisioning future possibilities. What does success look like? How will you know when you get there?</p>

<p>Sound familiar? It should. These are the exact questions you need to answer before you write a single line of code, regardless of how that code gets written.</p>

<h2 id="from-observations-to-user-stories-the-methods-that-actually-work">From Observations to User Stories: The Methods That Actually Work</h2>

<p>Here is where design thinking gets practical. LUMA does not just give you philosophy. It gives you a framework, specific methods for moving from “I have a bunch of observations” to “I know exactly what to build.” Here are the modules I use most often, with some examples.</p>

<h3 id="rose-thorn-bud-sorting-the-signal-from-the-noise">Rose, Thorn, Bud: Sorting the Signal from the Noise</h3>

<p>When you are in the Looking phase, you collect a lot of information. User interviews, support tickets, analytics data, competitor analysis. It is overwhelming. Rose, Thorn, Bud helps you make sense of it.</p>

<p>The method is simple. Take all your observations and sort them into three categories:</p>

<p><strong>Roses</strong> are things that are working well. These are the features users love, the workflows that feel smooth, the moments of delight. Do not skip this category. Understanding what works is just as important as understanding what does not.</p>

<p><strong>Thorns</strong> are pain points. These are the frustrations, the workarounds, the things that make users curse under their breath. Every thorn is a potential user story waiting to happen.</p>

<p><strong>Buds</strong> are opportunities. These are the “what if” moments. The features users wish existed. The adjacent problems you could solve. Buds often become your most innovative features.</p>

<p>Here is what this looks like in practice. Say you are building an expense tracking app for small business owners. After interviewing ten users, you might end up with:</p>

<p><strong>Roses:</strong></p>
<ul>
  <li>Users love being able to photograph receipts on their phone</li>
  <li>The monthly summary report saves them hours at tax time</li>
  <li>Integration with their bank account means less manual entry</li>
</ul>

<p><strong>Thorns:</strong></p>
<ul>
  <li>Categorizing expenses is tedious and error-prone</li>
  <li>No way to split a single receipt across multiple expense categories</li>
  <li>Forgot to log expenses until a week later, then could not remember the details</li>
</ul>

<p><strong>Buds:</strong></p>
<ul>
  <li>Several users mentioned wishing they could see spending trends over time</li>
  <li>One user manually tracks which expenses are tax-deductible versus not</li>
  <li>A few users share expense responsibilities with a business partner</li>
</ul>

<p>Now you have raw material for user stories. But which ones should you build first?</p>

<h3 id="importancedifficulty-matrix-picking-your-battles">Importance/Difficulty Matrix: Picking Your Battles</h3>

<p>Not all user stories are created equal. Some will take months to build and barely move the needle. Others can be shipped in a week and transform the user experience. The Importance/Difficulty Matrix helps you see the difference.</p>

<p>Draw a 2x2 grid. The vertical axis is Difficulty (how hard is this to implement?). The horizontal axis is Importance (how much does this matter to users?). Start with the importance and do your best to ensure that each item is in its own spot, no overlapping (if possible). Once you have the importance you can start to move each item up towards the proper difficulty level. This exercise helps you identify easy wins, important considerations for difficult initiatives, etc.</p>

<p>Plot each potential feature or user story on the grid.</p>

<p><strong>High Importance, Low Difficulty (bottom right)</strong>: These are your quick wins. Do these first. They deliver value fast and build momentum.</p>

<p><strong>High Importance, High Difficulty (top right)</strong>: These are your major projects. Important but need significant investment. Plan carefully.</p>

<p><strong>Low Importance, Low Difficulty (bottom left)</strong>: Fill-in work. Nice to have but not urgent. Good for junior developers or slow weeks.</p>

<p><strong>Low Importance, High Difficulty (top left)</strong>: Time sinks. Avoid these. They burn resources without delivering proportional value.</p>

<p>Taking our expense app thorns and buds:</p>

<table>
  <thead>
    <tr>
      <th>Feature</th>
      <th>Importance</th>
      <th>Difficulty</th>
      <th>Quadrant</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Auto-categorize expenses using AI</td>
      <td>High</td>
      <td>Medium</td>
      <td>Quick Win</td>
    </tr>
    <tr>
      <td>Split receipts across categories</td>
      <td>High</td>
      <td>Low</td>
      <td>Quick Win</td>
    </tr>
    <tr>
      <td>Spending trend visualizations</td>
      <td>Medium</td>
      <td>Medium</td>
      <td>Major Project</td>
    </tr>
    <tr>
      <td>Tax-deductible expense flagging</td>
      <td>High</td>
      <td>Low</td>
      <td>Quick Win</td>
    </tr>
    <tr>
      <td>Multi-user expense sharing</td>
      <td>Medium</td>
      <td>High</td>
      <td>Time Sink?</td>
    </tr>
    <tr>
      <td>Smart reminders to log expenses</td>
      <td>High</td>
      <td>Low</td>
      <td>Quick Win</td>
    </tr>
  </tbody>
</table>

<p><img src="./assets/img/posts/20260121/design-thinking-example-importance-difficulty-matrix.png" alt="Importance/Difficulty Matrix" /></p>

<p>Look at that. Four quick wins jumped out immediately. Those become your first sprint. The multi-user sharing might seem cool, but the effort-to-value ratio is terrible right now. Maybe later, once you have nailed the core experience.</p>

<h3 id="affinity-clustering-finding-the-patterns-you-missed">Affinity Clustering: Finding the Patterns You Missed</h3>

<p>Sometimes your observations do not fit neatly into categories. You have sticky notes everywhere and no clear picture. Affinity Clustering helps you find the hidden structure.</p>

<p>The process works like this:</p>

<ol>
  <li>Write each observation, quote, or insight on its own sticky note (or digital equivalent)</li>
  <li>Start grouping notes that seem related. Do not think too hard. Go with your gut.</li>
  <li>As clusters form, give them names. What theme connects these observations?</li>
  <li>Look at the clusters. Which ones are biggest? Which ones surprise you?</li>
</ol>

<p>I did this recently with support tickets for a SaaS product. The obvious clusters were what you would expect: billing issues, feature requests, bug reports. But then an unexpected cluster emerged: users trying to do things the product was never designed for. They were using a project management tool to run their entire small business.</p>

<p>That cluster became a product pivot discussion. We never would have seen it if we had just triaged tickets the normal way. This does not mean you always need to pivot, in the end we realized that a proper integration would be more beneficial, but that integration became a priority.</p>

<p>For our expense app, Affinity Clustering might reveal that three seemingly unrelated thorns are actually the same underlying problem:</p>

<ul>
  <li>“Categorizing expenses is tedious”</li>
  <li>“Forgot to log expenses until a week later”</li>
  <li>“Can’t remember what this $47.32 charge was for”</li>
</ul>

<p>Take the time to really dig into the root cause and why it matters when you try to title the cluster. Users lose the context needed to accurately log expenses as time passes.
Cluster name: <strong>“Context Decay: delays in reporting expenses lead to poor reporting, gaps in actual finances, and frustration from those filing/processing the expenses”</strong> vs <strong>“Context Decay”</strong>.</p>

<p>Now instead of three separate features, you have one user story that addresses the root cause:</p>

<blockquote>
  <p>As a small business owner, I want to log expenses immediately after purchase with minimal friction so that I capture accurate details before I forget them.</p>
</blockquote>

<p>That is a much better foundation than “make categorizing easier,” whether you are writing the code yourself or handing it off to an AI tool.</p>

<h2 id="the-user-story-clarity-that-scales">The User Story: Clarity That Scales</h2>

<p>User stories have been around since the early days of Agile. The format is simple:</p>

<blockquote>
  <p>As a [type of user], I want [some goal] so that [some reason].</p>
</blockquote>

<p>For example: “As a busy parent, I want to save my grocery list so that I do not have to remember everything at the store.”</p>

<p>But the magic is not in the format. It is in what the format forces you to do. You have to think about who the user is. You have to articulate what they actually want. And most importantly, you have to explain why it matters.</p>

<p>This is where the INVEST criteria comes in. Good user stories should be:</p>

<ul>
  <li><strong>Independent</strong>: Not tangled up with other stories</li>
  <li><strong>Negotiable</strong>: Open to conversation, not set in stone</li>
  <li><strong>Valuable</strong>: Actually useful to someone</li>
  <li><strong>Estimable</strong>: Clear enough to estimate the work</li>
  <li><strong>Small</strong>: Completable in a reasonable timeframe</li>
  <li><strong>Testable</strong>: You can verify when it is done</li>
</ul>

<p>These criteria matter whether you are handing the story to a junior developer, a senior architect, or an AI coding assistant. Vague input produces vague output, while clear input produces clear output. The tool does not change the equation.</p>

<h2 id="acceptance-criteria-the-bridge-between-what-and-done">Acceptance Criteria: The Bridge Between “What” and “Done”</h2>

<p>This is where things get really interesting. In Agile, every user story comes with acceptance criteria. These are the specific conditions that must be met for the story to be considered complete. The most common format uses Given/When/Then:</p>

<blockquote>
  <p><strong>Given</strong> [some context]
<strong>When</strong> [some action is taken]
<strong>Then</strong> [some outcome is expected]</p>
</blockquote>

<p>For example:</p>
<ul>
  <li>Given a user is logged in</li>
  <li>When they click “Save List”</li>
  <li>Then their grocery items should persist across sessions</li>
</ul>

<p>This is not just documentation, this is a testing specification. And whether you are writing tests by hand, using TDD, or letting AI generate your test suite, acceptance criteria tell you what “correct” actually means.</p>

<h2 id="putting-it-all-together-from-sticky-notes-to-shipping-code">Putting It All Together: From Sticky Notes to Shipping Code</h2>

<p>Let me walk you through the full workflow using our expense app example.</p>

<p><strong>Step 1: Looking (Rose, Thorn, Bud)</strong></p>

<p>You interview users and observe their behavior. You collect observations and sort them:</p>

<ul>
  <li><strong>Rose</strong>: “I love that I can just snap a photo of the receipt”</li>
  <li><strong>Thorn</strong>: “I always forget to log my expenses until it’s too late”</li>
  <li><strong>Thorn</strong>: “The categories never match how I actually think about spending”</li>
  <li><strong>Bud</strong>: “Wish it could just tell me what category something should be”</li>
</ul>

<p><strong>Step 2: Understanding (Affinity Clustering)</strong></p>

<p>You notice the thorns cluster around a theme: users lose context over time. The bud suggests users want intelligence, not just data entry.</p>

<p>Cluster: <strong>Context Decay and Cognitive Load: delays in reporting expenses lead to poor reporting, gaps in actual finances, and frustration from those filing/processing the expenses</strong></p>

<p>The real problem is not the interface. It is that expense tracking requires too much mental effort at the wrong time.</p>

<p><strong>Step 3: Prioritizing (Importance/Difficulty Matrix)</strong></p>

<p>You brainstorm potential solutions and plot them:</p>

<ul>
  <li>AI-powered auto-categorization: High importance, Medium difficulty → Build it</li>
  <li>Push notification reminders: High importance, Low difficulty → Build it now</li>
  <li>Voice memo for expense context: Medium importance, Low difficulty → Quick win</li>
  <li>Full accounting system integration: Medium importance, High difficulty → Later</li>
</ul>

<p><strong>Step 4: Writing User Stories</strong></p>

<p>Based on your analysis, you write:</p>

<blockquote>
  <p>As a small business owner who makes frequent purchases, I want expenses to be automatically categorized based on the merchant and amount so that I spend less mental energy on bookkeeping.</p>
</blockquote>

<p><strong>Step 5: Defining Acceptance Criteria</strong></p>

<blockquote>
  <p><strong>Given</strong> a user photographs a receipt from a merchant they have used before
<strong>When</strong> the app processes the image
<strong>Then</strong> it should auto-suggest the same category used previously with 90%+ confidence</p>

  <p><strong>Given</strong> a user photographs a receipt from a new merchant
<strong>When</strong> the app processes the image
<strong>Then</strong> it should suggest a category based on merchant type and purchase amount
<strong>And</strong> allow the user to confirm or change the category with one tap</p>

  <p><strong>Given</strong> a user overrides an auto-suggested category
<strong>When</strong> they encounter the same merchant again
<strong>Then</strong> the system should learn from the correction and suggest the new category</p>
</blockquote>

<p><strong>Step 6: Building</strong></p>

<p>At this point, you have everything you need to build the feature correctly. If you are coding by hand, you have clear requirements and testable criteria that you can put into a ticketing system (e.g. Jira, Trello, etc). If you are using AI tools, your prompt practically writes itself (but you should still use a ticketing system and create a feature branch based on that ticket, but I digress):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>You are building an intelligent expense categorization feature. Here is the context:

User Story: As a small business owner who makes frequent purchases, I want 
expenses to be automatically categorized based on the merchant and amount 
so that I spend less mental energy on bookkeeping.

The core insight from user research: Users lose context over time and find 
manual categorization tedious. They want the system to learn their preferences.

Acceptance Criteria:
1. Given a user photographs a receipt from a merchant they have used before
   When the app processes the image
   Then it should auto-suggest the same category used previously with 90%+ confidence

2. Given a user photographs a receipt from a new merchant
   When the app processes the image
   Then it should suggest a category based on merchant type and purchase amount
   And allow the user to confirm or change the category with one tap

3. Given a user overrides an auto-suggested category
   When they encounter the same merchant again
   Then the system should learn from the correction and suggest the new category

Tech stack: React Native, TypeScript, Supabase, OpenAI API for OCR
Design constraints: Must work offline with sync, categorization UI must be 
completable in under 3 seconds

Please implement this feature including:
- Data model for storing merchant-category associations and user corrections
- Service layer for category prediction logic
- React Native components for the categorization UI
- Unit tests that verify each acceptance criterion
</code></pre></div></div>

<p>Compare that to “build me an expense categorization feature.” The difference is not subtle, and it does not matter whether a human or an AI is on the receiving end.</p>

<h2 id="qa-testing-what-actually-matters">QA: Testing What Actually Matters</h2>

<p>Here is something that keeps me up at night: AI can generate tests for the code it writes. Sounds great, right? The problem is that AI-generated tests often only cover the happy path. They test what the code does, not what it should do.</p>

<p>This is why acceptance criteria matter so much. They are not just documentation. They are your test specification. When you write:</p>

<blockquote>
  <p><strong>Given</strong> a user with no internet connection
<strong>When</strong> they try to save a task
<strong>Then</strong> the task should be queued locally and synced when connection is restored</p>
</blockquote>

<p>You have just written a test case! A test case that reflects real user needs that you discovered through the Looking and Understanding phases of design thinking.</p>

<p>The teams that are shipping quality software are the ones who:</p>

<ol>
  <li>Start with design thinking to understand the actual problem</li>
  <li>Use methods like Rose/Thorn/Bud, Affinity Clustering, and Importance/Difficulty to synthesize insights</li>
  <li>Write user stories with clear acceptance criteria</li>
  <li>Use those criteria to verify the implementation, however it was built</li>
  <li>Review the output against the original user story, not just “does it run”</li>
</ol>

<h2 id="metrics-that-actually-mean-something">Metrics That Actually Mean Something</h2>

<p>When you have clear user stories and acceptance criteria, measuring success becomes almost trivial. Did we meet the acceptance criteria? Yes or no. Does the feature solve the problem described in the user story? Yes or no.</p>

<p>Compare this to the alternative: “We shipped 47 features this quarter.” Great. Did any of them matter?</p>

<p>Some metrics that actually work:</p>

<ul>
  <li><strong>Acceptance Criteria Pass Rate</strong>: What percentage of your acceptance criteria are met on first deployment?</li>
  <li><strong>User Story Completion</strong>: Not “did we build it” but “did we solve the problem”</li>
  <li><strong>Iteration Count</strong>: How many times did you have to go back and refine? (Fewer is better and cheaper, and it usually means your original specification was clearer)</li>
  <li><strong>Post-Deploy Defects by Criteria Type</strong>: Where are the bugs? In the functionality? The edge cases? The performance? This tells you where your acceptance criteria need more rigor.</li>
</ul>

<h2 id="a-word-of-caution">A Word of Caution</h2>

<p>I am not saying design thinking and user stories will solve all your development problems. Complex systems still have complex bugs. Edge cases still slip through. Requirements still change mid-project.</p>

<p>But here is the thing: those problems are way easier to catch when you know what you were trying to build in the first place. When you have clear acceptance criteria, you can actually test whether the implementation is correct. When you have done the design thinking work, you can recognize when you are solving the wrong problem.</p>

<p>The developers I know who are thriving are not the ones who type the fastest or use the newest tools. They are the ones who got better at thinking clearly about what they are building.</p>

<h2 id="getting-started">Getting Started</h2>

<p>If you want to try this approach, here is what I would suggest:</p>

<ol>
  <li>
    <p><strong>Start with Rose/Thorn/Bud</strong>: Before you decide what to build, understand the landscape. Talk to users, review support tickets, observe behavior. Sort everything into roses (working well), thorns (pain points), and buds (opportunities). Do not skip this step. You cannot write good user stories about problems you do not understand.</p>
  </li>
  <li>
    <p><strong>Cluster your observations</strong>: Spread out your roses, thorns, and buds. Group the ones that seem related. Name the clusters. This is where you discover that five seemingly unrelated complaints are actually one underlying problem. These clusters become the foundation for your user stories.</p>
  </li>
  <li>
    <p><strong>Prioritize with Importance/Difficulty</strong>: Now you have clusters of insights. Plot them on the matrix. Which problems are high importance and low difficulty? Those are your starting points. The matrix does not tell you what features to build. It tells you which <em>problems</em> to solve first.</p>
  </li>
  <li>
    <p><strong>Let user stories emerge</strong>: For each prioritized problem, write a user story. The story should flow naturally from your research. “As a [user you interviewed], I want [solution to the thorn you identified] so that [benefit that addresses the underlying cluster].” If you did the design thinking work, the user stories should almost write themselves.</p>
  </li>
  <li>
    <p><strong>Define acceptance criteria</strong>: For each user story, write Given/When/Then statements that describe what success looks like. These come directly from your observations. What did users say they needed? What would make the thorn disappear?</p>
  </li>
  <li>
    <p><strong>Store the context in your codebase</strong>: Create a directory (something like <code class="language-plaintext highlighter-rouge">/docs/stories</code> or <code class="language-plaintext highlighter-rouge">/.context</code>) and add it to your <code class="language-plaintext highlighter-rouge">.gitignore</code>. Save your user stories and acceptance criteria as markdown or text files. When you are working with AI coding tools like Claude Code or Cursor, they can reference these files directly. This means you do not have to paste the same context into every prompt. The AI has access to the full picture: the user story, the acceptance criteria, the reasoning behind your decisions. Update these files as your understanding evolves. Even if you are not using AI tools, having this documentation in your repo keeps everyone aligned.</p>
  </li>
  <li>
    <p><strong>Build with confidence</strong>: With your context in place, you know exactly what you are building and how to verify it. Whether you are writing code by hand, pair programming, or using AI assistance, the foundation is solid.</p>
  </li>
</ol>

<p>Notice what is missing from this list: “Pick a feature.” You do not start by deciding what to build, you start by understanding what problems exist. The features reveal themselves through the process.</p>

<h2 id="the-bottom-line">The Bottom Line</h2>

<p>The way we write code is changing. AI tools are getting better every month. But the teams that will thrive are not the ones who learn to use the newest IDE or write the fanciest prompts. They are the ones who learn to think clearly about what they are building before they start.</p>

<p>This is product management. Whether your title says “PM” or not, the moment you start defining problems, prioritizing solutions, and writing acceptance criteria, you are doing product work. Design thinking gives you the methods. User stories give you the format. Acceptance criteria give you the tests. Together, they give you the ability to build software that actually solves problems.</p>

<p>The irony is that as building gets easier, the human skills of understanding users, synthesizing insights, and defining success become more valuable, not less. The sticky notes on the whiteboard are not obsolete. They are now the most important part of the process.</p>

<p>The revolution is not in the tools, it is in what you feed them, and feeding them well is a product management problem.</p>]]></content><author><name>Vatché</name></author><category term="AI" /><category term="Development" /><category term="Claude Code" /><category term="Product Management" /><category term="Design Thinking" /><category term="Claude Code" /><category term="Development &amp; DevOps" /><category term="Artificial Intelligence" /><category term="Workflow Optimization" /><category term="Product Management" /><summary type="html"><![CDATA[A deep dive into the intersection of design thinking and software development, exploring how product management and user-centered design improve outcomes whether you're using AI tools or writing code by hand.]]></summary></entry><entry><title type="html">The Multi-Agent Approach: How Claude Code’s Creator Actually Uses the Tool</title><link href="https://vatchechamlian.com/orchestrating-agents-claude.html" rel="alternate" type="text/html" title="The Multi-Agent Approach: How Claude Code’s Creator Actually Uses the Tool" /><published>2026-01-06T00:00:00+00:00</published><updated>2026-01-06T00:00:00+00:00</updated><id>https://vatchechamlian.com/orchestrating-agents-claude</id><content type="html" xml:base="https://vatchechamlian.com/orchestrating-agents-claude.html"><![CDATA[<p>When Boris Cherny, the creator of Claude Code, shared his personal workflow, it revealed something fascinating: the most powerful way to use AI coding assistants isn’t to replace your terminal with a single chatbot. Instead, it’s about orchestrating multiple AI agents working in parallel, each on their own task, while you 
context-switch between them like a conductor managing an orchestra.</p>

<p><a href="https://x.com/bcherny/status/2007179832300581177">Boris Cherny’s Post on X</a></p>

<p>After months of testing various AI coding platforms (as I detailed in my <a href="/vibe-coding-reviews.html">vibe coding review</a>), I’ve come to appreciate that the implementation details matter as much as the underlying AI model. Boris’s approach represents a fundamentally different philosophy from the “conversational app builder” platforms I tested. His focus is on production workflows rather than rapid prototyping.</p>

<h2 id="the-parallel-processing-paradigm">The Parallel Processing Paradigm</h2>

<h3 id="running-5-15-claudes-simultaneously">Running 5-15 Claudes Simultaneously</h3>

<p>Boris runs 5 Claude agents in his terminal (numbered tabs 1-5) plus 5-10 more on claude.ai/code, all working simultaneously. He uses iTerm2 system notifications to know when any agent needs input, allowing him to work on multiple features across different branches without waiting for any single agent to complete.</p>

<p>This isn’t just about speed. It’s about matching how developers actually think. While Claude works on implementing feature A, you can be planning feature B with another agent, reviewing feature C’s output, and testing feature D. The cognitive overhead of managing multiple agents is lower than you’d expect because each one maintains its own context and conversation history.</p>

<p><strong>Key setup elements:</strong></p>
<ul>
  <li>iTerm2 configured for system notifications when agents need input</li>
  <li>Numbered terminal tabs (1-5) for quick identification</li>
  <li>Browser sessions on claude.ai/code for additional parallelism</li>
  <li>Mobile app sessions kicked off throughout the day for background work</li>
  <li><code class="language-plaintext highlighter-rouge">--teleport</code> command for moving sessions between terminal and web</li>
</ul>

<p>The mobile workflow is particularly clever: start a few complex tasks from your phone in the morning, let them run while you commute or grab coffee, then check results when you’re at your desk.</p>

<h2 id="the-claudemd-strategy-team-knowledge-management">The CLAUDE.md Strategy: Team Knowledge Management</h2>

<p>One of the most underappreciated features Boris highlights is the shared <code class="language-plaintext highlighter-rouge">CLAUDE.md</code> file, which is a repository-level instruction set that the entire team maintains and Claude reads before every interaction.</p>

<h3 id="how-it-works">How It Works</h3>

<p>Every time Claude makes a mistake or the team establishes a new pattern, it gets added to <code class="language-plaintext highlighter-rouge">CLAUDE.md</code>. Here’s a real example from the Claude Code repository:</p>

<p>Note: Bun is a fast JavaScript runtime and package manager, similar to Node.js and npm but significantly faster. The Claude Code team has standardized on it for their workflow.</p>

<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gh"># Development Workflow</span>

<span class="gs">**Always use `bun`, not `npm`.**</span>

<span class="gh"># 1. Make changes</span>

<span class="gh"># 2. Typecheck (fast)</span>
bun run typecheck

<span class="gh"># 3. Run tests</span>
bun run test -- -t "test name"      # Single suite
bun run test:file -- "glob"         # Specific files

<span class="gh"># 4. Lint before committing</span>
bun run lint:file -- "file1.ts"     # Specific files
bun run lint                         # All files

<span class="gh"># 5. Before creating PR</span>
bun run lint:claude &amp;&amp; bun run test
</code></pre></div></div>

<p>This creates a compounding knowledge base. Instead of repeatedly telling Claude “don’t use enums, use string literal unions,” you add it once to <code class="language-plaintext highlighter-rouge">CLAUDE.md</code> and it applies to all future interactions for all team members.</p>

<h3 id="github-integration-living-documentation">GitHub Integration: Living Documentation</h3>

<p>The Claude Code team uses the Claude Code GitHub Action to update <code class="language-plaintext highlighter-rouge">CLAUDE.md</code> directly from code reviews. This means improvements to the knowledge base happen as a natural part of the development workflow.</p>

<p><strong>How it works:</strong></p>

<p>During code review, you can tag the Claude bot:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@.claude add to CLAUDE.md to never use enums, always prefer literal unions
</code></pre></div></div>

<p>The bot responds with a plan, makes the change to CLAUDE.md, and commits it. This is what Dan Shipper calls “Compounding Engineering.” Each code review makes the system slightly smarter for everyone on the team.</p>

<p><strong>To set this up:</strong></p>

<ol>
  <li>Install the Claude Code GitHub Action from the GitHub Marketplace</li>
  <li>Configure it with your repository</li>
  <li>Grant it write access to your repo</li>
  <li>Start tagging @.claude in your code reviews</li>
</ol>

<p>Within a few weeks, your <code class="language-plaintext highlighter-rouge">CLAUDE.md</code> will evolve from a basic template into a comprehensive guide that captures your team’s collective knowledge.</p>

<h2 id="plan-mode-the-secret-to-one-shot-success">Plan Mode: The Secret to One-Shot Success</h2>

<p>Boris’s sessions typically start in Plan mode (triggered with <code class="language-plaintext highlighter-rouge">shift+tab</code> twice). This is crucial: instead of letting Claude jump straight to implementation, he iterates on the <em>approach</em> first:</p>

<ol>
  <li><strong>Describe the goal</strong> in Plan mode</li>
  <li><strong>Refine the plan</strong> through back-and-forth until it’s solid</li>
  <li><strong>Switch to auto-accept edits</strong> mode</li>
  <li><strong>Let Claude implement</strong> (usually succeeds in one shot)</li>
</ol>

<p>This mirrors my recommendation from the vibe coding article about establishing approach before implementation, but Boris takes it further by using a dedicated mode rather than just prompting for it.</p>

<p>The time investment in planning pays off exponentially. A good plan lets Claude work autonomously without breaking existing functionality—the exact failure mode I encountered repeatedly with vibe coding platforms.</p>

<h2 id="slash-commands-automating-inner-loops">Slash Commands: Automating Inner Loops</h2>

<p>Slash commands save you from repeated prompting and make it possible for Claude to use your workflows too. Boris creates custom slash commands for every “inner loop” workflow that he does many times a day. What Boris is referring to as “slash commands” are actually what Claude Code calls “skills.”</p>

<p>If you don’t know how to create skills, check out my <a href="/claude-code-skills-guide.html">guide</a>.</p>

<p><strong>What are inner loop workflows?</strong></p>

<p>These are the small, repetitive tasks you do constantly during development:</p>
<ul>
  <li>Running tests</li>
  <li>Committing and pushing code</li>
  <li>Checking type errors</li>
  <li>Running the linter</li>
  <li>Deploying to staging</li>
</ul>

<p>Commands are stored in <code class="language-plaintext highlighter-rouge">.claude/commands/</code> and checked into git so the whole team can use them.</p>

<p><strong>Example: The <code class="language-plaintext highlighter-rouge">/commit-push-pr</code> command</strong></p>

<p>This is Boris’s most-used command, running dozens of times per day:</p>

<p>Create <code class="language-plaintext highlighter-rouge">.claude/commands/commit-push-pr.sh</code>:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>
<span class="c"># Commits changes, pushes to remote, and creates a PR</span>

<span class="nb">set</span> <span class="nt">-e</span>  <span class="c"># Exit on any error</span>

<span class="nb">echo</span> <span class="s2">"📝 Preparing to commit and push..."</span>

<span class="c"># Pre-compute git status to avoid back-and-forth with Claude</span>
<span class="nv">MODIFIED_FILES</span><span class="o">=</span><span class="si">$(</span>git diff <span class="nt">--name-only</span><span class="si">)</span>
<span class="nv">BRANCH</span><span class="o">=</span><span class="si">$(</span>git branch <span class="nt">--show-current</span><span class="si">)</span>
<span class="nv">UPSTREAM_BRANCH</span><span class="o">=</span><span class="si">$(</span>git rev-parse <span class="nt">--abbrev-ref</span> <span class="nt">--symbolic-full-name</span> @<span class="o">{</span>u<span class="o">}</span> 2&gt;/dev/null <span class="o">||</span> <span class="nb">echo</span> <span class="s2">""</span><span class="si">)</span>

<span class="k">if</span> <span class="o">[</span> <span class="nt">-z</span> <span class="s2">"</span><span class="nv">$MODIFIED_FILES</span><span class="s2">"</span> <span class="o">]</span><span class="p">;</span> <span class="k">then
    </span><span class="nb">echo</span> <span class="s2">"❌ No changes to commit"</span>
    <span class="nb">exit </span>1
<span class="k">fi

</span><span class="nb">echo</span> <span class="s2">"Modified files:"</span>
<span class="nb">echo</span> <span class="s2">"</span><span class="nv">$MODIFIED_FILES</span><span class="s2">"</span>
<span class="nb">echo</span> <span class="s2">""</span>

<span class="c"># Get commit message from Claude or user</span>
<span class="nb">read</span> <span class="nt">-p</span> <span class="s2">"Commit message: "</span> COMMIT_MSG

<span class="k">if</span> <span class="o">[</span> <span class="nt">-z</span> <span class="s2">"</span><span class="nv">$COMMIT_MSG</span><span class="s2">"</span> <span class="o">]</span><span class="p">;</span> <span class="k">then
    </span><span class="nb">echo</span> <span class="s2">"❌ Commit message required"</span>
    <span class="nb">exit </span>1
<span class="k">fi</span>

<span class="c"># Commit changes</span>
git add <span class="nt">-A</span>
git commit <span class="nt">-m</span> <span class="s2">"</span><span class="nv">$COMMIT_MSG</span><span class="s2">"</span>

<span class="nb">echo</span> <span class="s2">"✅ Committed changes"</span>

<span class="c"># Push to remote</span>
<span class="k">if</span> <span class="o">[</span> <span class="nt">-z</span> <span class="s2">"</span><span class="nv">$UPSTREAM_BRANCH</span><span class="s2">"</span> <span class="o">]</span><span class="p">;</span> <span class="k">then
    </span>git push <span class="nt">-u</span> origin <span class="s2">"</span><span class="nv">$BRANCH</span><span class="s2">"</span>
<span class="k">else
    </span>git push
<span class="k">fi

</span><span class="nb">echo</span> <span class="s2">"✅ Pushed to remote"</span>

<span class="c"># Create PR if gh CLI is available</span>
<span class="k">if </span><span class="nb">command</span> <span class="nt">-v</span> gh &amp;&gt; /dev/null<span class="p">;</span> <span class="k">then
    </span><span class="nb">read</span> <span class="nt">-p</span> <span class="s2">"Create PR? (y/n): "</span> CREATE_PR
    <span class="k">if</span> <span class="o">[</span> <span class="s2">"</span><span class="nv">$CREATE_PR</span><span class="s2">"</span> <span class="o">=</span> <span class="s2">"y"</span> <span class="o">]</span><span class="p">;</span> <span class="k">then
        </span>gh <span class="nb">pr </span>create <span class="nt">--fill</span>
        <span class="nb">echo</span> <span class="s2">"✅ PR created"</span>
    <span class="k">fi
fi</span>
</code></pre></div></div>

<p>Make it executable:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">chmod</span> +x .claude/commands/commit-push-pr.sh
</code></pre></div></div>

<p><strong>Why inline bash matters:</strong></p>

<p>Notice the command pre-computes git status at the start:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">MODIFIED_FILES</span><span class="o">=</span><span class="si">$(</span>git diff <span class="nt">--name-only</span><span class="si">)</span>
<span class="nv">BRANCH</span><span class="o">=</span><span class="si">$(</span>git branch <span class="nt">--show-current</span><span class="si">)</span>
</code></pre></div></div>

<p>This information is gathered once and available throughout the script. Without this, Claude would need to:</p>
<ol>
  <li>Run <code class="language-plaintext highlighter-rouge">git status</code></li>
  <li>Wait for response</li>
  <li>Ask what files to commit</li>
  <li>Wait for response</li>
  <li>Run <code class="language-plaintext highlighter-rouge">git commit</code></li>
  <li>And so on…</li>
</ol>

<p>By pre-computing everything, the command runs quickly without back-and-forth.</p>

<p><strong>More useful commands to create:</strong></p>

<p><code class="language-plaintext highlighter-rouge">.claude/commands/test-focused.sh</code>:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>
<span class="c"># Run tests for files that changed</span>

<span class="nv">CHANGED_FILES</span><span class="o">=</span><span class="si">$(</span>git diff <span class="nt">--name-only</span> | <span class="nb">grep</span> <span class="nt">-E</span> <span class="s1">'\.(ts|js|tsx|jsx)$'</span><span class="si">)</span>

<span class="k">for </span>file <span class="k">in</span> <span class="nv">$CHANGED_FILES</span><span class="p">;</span> <span class="k">do</span>
    <span class="c"># Convert source file to test file path</span>
    <span class="nv">TEST_FILE</span><span class="o">=</span><span class="si">$(</span><span class="nb">echo</span> <span class="nv">$file</span> | <span class="nb">sed</span> <span class="s1">'s/src/src/__tests__/; s/\.tsx\?/.test.ts/'</span><span class="si">)</span>
    
    <span class="k">if</span> <span class="o">[</span> <span class="nt">-f</span> <span class="s2">"</span><span class="nv">$TEST_FILE</span><span class="s2">"</span> <span class="o">]</span><span class="p">;</span> <span class="k">then
        </span><span class="nb">echo</span> <span class="s2">"Testing </span><span class="nv">$TEST_FILE</span><span class="s2">..."</span>
        npm <span class="nb">test</span> <span class="nt">--</span> <span class="s2">"</span><span class="nv">$TEST_FILE</span><span class="s2">"</span>
    <span class="k">fi
done</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">.claude/commands/quick-check.sh</code>:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>
<span class="c"># Fast checks before committing</span>

<span class="nb">echo</span> <span class="s2">"Running quick checks..."</span>

<span class="c"># Type check</span>
<span class="nb">echo</span> <span class="s2">"1/3 Type checking..."</span>
npm run typecheck

<span class="c"># Lint</span>
<span class="nb">echo</span> <span class="s2">"2/3 Linting..."</span>  
npm run lint

<span class="c"># Quick tests (not full suite)</span>
<span class="nb">echo</span> <span class="s2">"3/3 Running changed file tests..."</span>
.claude/commands/test-focused.sh

<span class="nb">echo</span> <span class="s2">"✅ All checks passed!"</span>
</code></pre></div></div>

<p>Now Claude can run <code class="language-plaintext highlighter-rouge">/quick-check</code> before committing, or you can run it manually. The key is identifying your most common workflows and automating them.</p>

<h2 id="subagents-specialized-ai-workers">Subagents: Specialized AI Workers</h2>

<p>Subagents are like specialized team members with specific expertise. Rather than asking the generalist Claude to both implement code AND simplify it, you hand off to a specialist agent once implementation is done.</p>

<p>Boris maintains several subagents in <code class="language-plaintext highlighter-rouge">.claude/agents/</code> for common post-processing tasks:</p>

<ul>
  <li><strong>code-simplifier</strong>: Refactors Claude’s output after implementation</li>
  <li><strong>verify-app</strong>: Runs comprehensive end-to-end tests on Claude Code itself</li>
  <li><strong>build-validator</strong>: Checks build integrity</li>
  <li><strong>code-architect</strong>: Reviews large changes for architectural consistency</li>
</ul>

<p><strong>How to create your first subagent:</strong></p>

<p>Create a file <code class="language-plaintext highlighter-rouge">.claude/agents/code-simplifier.md</code>:</p>

<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gh"># Code Simplifier Agent</span>

You are a specialist in code simplification and refactoring.
Your job is to take working code and make it more maintainable 
without changing behavior.

<span class="gu">## Your responsibilities:</span>
<span class="p">1.</span> Remove duplicate code
<span class="p">2.</span> Extract complex logic into well-named functions
<span class="p">3.</span> Simplify conditional statements
<span class="p">4.</span> Improve variable and function names
<span class="p">5.</span> Add helpful comments for complex logic

<span class="gu">## What you should NOT do:</span>
<span class="p">-</span> Do not change functionality
<span class="p">-</span> Do not add new features
<span class="p">-</span> Do not remove tests
<span class="p">-</span> Do not modify public APIs

<span class="gu">## Process:</span>
<span class="p">1.</span> Read the files that were just modified
<span class="p">2.</span> Identify simplification opportunities
<span class="p">3.</span> Make changes one file at a time
<span class="p">4.</span> Run tests after each change
<span class="p">5.</span> Report what was simplified
</code></pre></div></div>

<p><strong>Using subagents:</strong></p>

<p>After Claude implements a feature, you can hand off to a subagent:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>You: "Great! Now invoke the code-simplifier agent to clean this up"
Claude: [Calls code-simplifier subagent]
Code Simplifier: [Reviews and refactors the code]
</code></pre></div></div>

<p>Start with one or two subagents for your most common post-processing tasks, then add more as you identify patterns.</p>

<h2 id="the-verification-loop-2-3x-quality-improvement">The Verification Loop: 2-3x Quality Improvement</h2>

<p>Boris emphasizes that the single most important factor for quality is <strong>giving Claude a way to verify its work</strong>. Without this feedback loop, Claude can’t iterate to fix problems.</p>

<p>For Claude Code itself, Boris uses the Claude Chrome extension to:</p>
<ol>
  <li>Open a browser</li>
  <li>Test the UI</li>
  <li>Iterate until code works and UX feels good</li>
</ol>

<p>This automated testing happens for <em>every</em> change landed to claude.ai/code. The extension can click through interfaces, verify visual elements, and report issues back to Claude for fixes.</p>

<p><strong>Verification looks different per domain:</strong></p>
<ul>
  <li><strong>CLI tools</strong>: Run the tool and verify output</li>
  <li><strong>Web apps</strong>: Browser automation testing</li>
  <li><strong>Mobile apps</strong>: Simulator testing</li>
  <li><strong>APIs</strong>: Automated integration tests</li>
  <li><strong>Data pipelines</strong>: Sample data validation</li>
</ul>

<p>The principle is universal: create a fast, reliable way for Claude to check its own work, and quality will dramatically improve.</p>

<h2 id="hooks-automated-code-quality">Hooks: Automated Code Quality</h2>

<p>Hooks in Claude Code let you run commands automatically at specific points in the workflow. Boris uses PostToolUse hooks to automatically format code after Claude makes changes.</p>

<p><strong>What are hooks?</strong></p>

<p>Think of hooks as automatic quality checks that run without you having to remember them. They’re triggered by specific events in Claude’s workflow:</p>

<ul>
  <li><strong>PostToolUse</strong>: Runs after Claude uses a tool (like editing a file)</li>
  <li><strong>PreToolUse</strong>: Runs before Claude uses a tool</li>
  <li><strong>Stop</strong>: Runs when a long task completes</li>
  <li><strong>Error</strong>: Runs when something goes wrong</li>
</ul>

<p><strong>Setting up your first hook:</strong></p>

<p>Create or edit <code class="language-plaintext highlighter-rouge">.claude/settings.json</code> in your project:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"PostToolUse"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
    </span><span class="p">{</span><span class="w">
      </span><span class="nl">"matcher"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Write|Edit"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"hooks"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
        </span><span class="p">{</span><span class="w">
          </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"command"</span><span class="p">,</span><span class="w">
          </span><span class="nl">"command"</span><span class="p">:</span><span class="w"> </span><span class="s2">"npm run format || true"</span><span class="w">
        </span><span class="p">}</span><span class="w">
      </span><span class="p">]</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>This configuration does several things:</p>
<ul>
  <li><strong>matcher</strong>: Triggers only when Claude writes or edits files (not when reading)</li>
  <li><strong>command</strong>: Runs your code formatter</li>
  <li><strong>|| true</strong>: Ensures the hook doesn’t fail if formatting has warnings</li>
</ul>

<p>If you’re using Bun (like the Claude Code team), it would look like:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"PostToolUse"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
    </span><span class="p">{</span><span class="w">
      </span><span class="nl">"matcher"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Write|Edit"</span><span class="p">,</span><span class="w"> 
      </span><span class="nl">"hooks"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
        </span><span class="p">{</span><span class="w">
          </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"command"</span><span class="p">,</span><span class="w">
          </span><span class="nl">"command"</span><span class="p">:</span><span class="w"> </span><span class="s2">"bun run format || true"</span><span class="w">
        </span><span class="p">}</span><span class="w">
      </span><span class="p">]</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p><strong>Why this matters:</strong></p>

<p>Without this hook, you’d need to:</p>
<ol>
  <li>Let Claude make changes</li>
  <li>Remember to run the formatter</li>
  <li>Commit the formatting changes separately</li>
  <li>Or worse, have CI fail because of formatting issues</li>
</ol>

<p>With the hook, formatting happens automatically after every file change. Claude’s code is already formatted by the time you review it.</p>

<p><strong>Other useful hooks:</strong></p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"PostToolUse"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
    </span><span class="p">{</span><span class="w">
      </span><span class="nl">"matcher"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Write|Edit"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"hooks"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
        </span><span class="p">{</span><span class="w">
          </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"command"</span><span class="p">,</span><span class="w">
          </span><span class="nl">"command"</span><span class="p">:</span><span class="w"> </span><span class="s2">"npm run format || true"</span><span class="w">
        </span><span class="p">},</span><span class="w">
        </span><span class="p">{</span><span class="w">
          </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"command"</span><span class="p">,</span><span class="w"> 
          </span><span class="nl">"command"</span><span class="p">:</span><span class="w"> </span><span class="s2">"npm run lint:fix || true"</span><span class="w">
        </span><span class="p">}</span><span class="w">
      </span><span class="p">]</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">],</span><span class="w">
  </span><span class="nl">"Stop"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
    </span><span class="p">{</span><span class="w">
      </span><span class="nl">"matcher"</span><span class="p">:</span><span class="w"> </span><span class="s2">"*"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"hooks"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
        </span><span class="p">{</span><span class="w">
          </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"command"</span><span class="p">,</span><span class="w">
          </span><span class="nl">"command"</span><span class="p">:</span><span class="w"> </span><span class="s2">".claude/commands/verify-app.sh"</span><span class="w">
        </span><span class="p">}</span><span class="w">
      </span><span class="p">]</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>This setup:</p>
<ol>
  <li>Formats and lints code after every edit</li>
  <li>Runs comprehensive verification when tasks complete</li>
</ol>

<p>Start with just the formatting hook, then add more as you identify patterns.</p>

<h2 id="permission-management-security-without-friction">Permission Management: Security Without Friction</h2>

<p>Rather than using <code class="language-plaintext highlighter-rouge">--dangerously-skip-permissions</code>, Boris pre-allows safe commands through <code class="language-plaintext highlighter-rouge">/permissions</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Bash(bq query:*)
Bash(bun run build:*)
Bash(bun run lint:file:*)
Bash(bun run test:*)
Bash(bun run test:file:*)
Bash(bun run typecheck:*)
Bash(test:*)
Bash(cc:*)
Bash(comm:*)
Bash(find:*)
</code></pre></div></div>

<p>These permissions are stored in <code class="language-plaintext highlighter-rouge">.claude/settings.json</code> and shared with the team. It’s a middle ground: Claude can work autonomously on common operations while still requiring confirmation for potentially dangerous commands.</p>

<p>For sandbox environments or very long-running tasks, Boris will occasionally use <code class="language-plaintext highlighter-rouge">--permission-mode=dontAsk</code> to let Claude “cook without being blocked,” but this is reserved for isolated contexts.</p>

<h2 id="mcp-integration-external-tool-access">MCP Integration: External Tool Access</h2>

<p>Claude Code uses MCP (Model Context Protocol) to interact with Boris’s entire tool ecosystem:</p>

<ul>
  <li><strong>Slack</strong>: Search conversations, post updates</li>
  <li><strong>BigQuery</strong>: Run analytics queries via <code class="language-plaintext highlighter-rouge">bq</code> CLI</li>
  <li><strong>Sentry</strong>: Pull error logs automatically</li>
  <li><strong>Custom internal tools</strong>: Anything with a CLI interface</li>
</ul>

<p>The Slack MCP configuration is checked into <code class="language-plaintext highlighter-rouge">.mcp.json</code> and shared:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"mcpServers"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"slack"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"http"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"url"</span><span class="p">:</span><span class="w"> </span><span class="s2">"https://slack.mcp.anthropic.com/mcp"</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>This means Claude can autonomously:</p>
<ul>
  <li>Search for relevant Slack discussions when debugging</li>
  <li>Post status updates to team channels</li>
  <li>Query production analytics to verify changes</li>
  <li>Pull error logs to understand user issues</li>
</ul>

<p>The friction of context-switching between tools disappears when Claude can access them directly.</p>

<h2 id="the-ralph-wiggum-plugin-long-running-task-safety">The Ralph Wiggum Plugin: Long-Running Task Safety</h2>

<p>For very long-running tasks (deployments, migrations, extensive refactors), Boris uses the “ralph-wiggum” plugin. This plugin was originally created by Geoffrey Huntley and implements a verification step when tasks complete.</p>

<p>The plugin is named after a character from The Simpsons who is famous for enthusiastically declaring “I’m helping!” while unknowingly making situations worse. The name is perfect because it captures a real risk with AI: agents that work unsupervised for hours might produce code that seems complete but actually breaks things.</p>

<p>The ralph-wiggum plugin ensures that when Claude finishes a multi-hour task unsupervised, the results are actually correct before merging. It runs a comprehensive verification suite and can even alert you if something seems off.</p>

<p>Combined with Stop hooks, this creates a safety net: long tasks can run overnight or while you’re in meetings, with automatic verification before results are committed.</p>

<h2 id="the-model-choice-opus-45-with-extended-thinking">The Model Choice: Opus 4.5 with Extended Thinking</h2>

<p>Boris runs Opus 4.5 with extended thinking enabled for everything, even though it’s slower than Sonnet. His reasoning might surprise you:</p>

<blockquote>
  <p>“You have to steer it less and it’s better at tool use, so it is almost always faster than using a smaller model in the end.”</p>
</blockquote>

<p>This contradicts conventional wisdom about using faster models for simple tasks. But Boris’s experience shows that the end-to-end time from prompt to working code is actually lower with Opus because:</p>

<ol>
  <li><strong>Fewer correction cycles:</strong> Opus gets it right more often on the first try</li>
  <li><strong>Better tool use:</strong> Less back-and-forth when running commands</li>
  <li><strong>Deeper understanding:</strong> Handles complex refactors that would require multiple iterations with smaller models</li>
</ol>

<p>The “extended thinking” mode (previously called “chain of thought”) lets the model work through problems more thoroughly before responding. You’ll see Claude’s reasoning process in real-time, which helps you understand its approach and catch potential issues early.</p>

<p>Think of it this way: a slower, more capable model that succeeds in one attempt is faster than a quick model that requires three rounds of corrections.</p>

<h2 id="multi-branch-parallelism-avoiding-conflicts">Multi-Branch Parallelism: Avoiding Conflicts</h2>

<p>A critical detail that Boris mentions: when running multiple agents on the same codebase, each agent works on its own feature branch in its own repository clone.</p>

<p><strong>Why this approach matters:</strong></p>

<p>When you run 3-5 agents simultaneously all making changes to the main branch, you’re going to hit merge conflicts constantly. It becomes a mess of competing edits where Agent 1’s changes conflict with Agent 2’s changes, and you spend more time resolving conflicts than actually developing.</p>

<p>Instead, here’s the workflow Boris uses:</p>

<p><strong>Setting up isolated workspaces:</strong></p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Create a directory for your parallel work</span>
<span class="nb">mkdir</span> <span class="nt">-p</span> ~/code/myproject-parallel

<span class="c"># Clone your repo multiple times</span>
<span class="nb">cd</span> ~/code/myproject-parallel
git clone git@github.com:yourorg/myproject.git agent1
git clone git@github.com:yourorg/myproject.git agent2  
git clone git@github.com:yourorg/myproject.git agent3

<span class="c"># In each clone, create a feature branch</span>
<span class="nb">cd </span>agent1 <span class="o">&amp;&amp;</span> git checkout <span class="nt">-b</span> feature/authentication
<span class="nb">cd</span> ../agent2 <span class="o">&amp;&amp;</span> git checkout <span class="nt">-b</span> feature/database-migration
<span class="nb">cd</span> ../agent3 <span class="o">&amp;&amp;</span> git checkout <span class="nt">-b</span> docs/api-updates
</code></pre></div></div>

<p>Now each agent has:</p>
<ul>
  <li>Its own working directory</li>
  <li>Its own feature branch</li>
  <li>Complete isolation from other agents</li>
  <li>Full context for its specific task</li>
</ul>

<p><strong>Benefits of this approach:</strong></p>
<ul>
  <li>PRs remain clean and focused</li>
  <li>Merge conflicts are rare (each branch diverges from main separately)</li>
  <li>Each agent has complete context for its branch</li>
  <li>You can abandon failed experiments without affecting other work</li>
  <li>Code reviews are clearer because each PR has a single purpose</li>
</ul>

<p><strong>Managing the workflow:</strong></p>

<p>Open a terminal tab for each agent workspace:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Tab 1: Authentication feature</span>
<span class="nb">cd</span> ~/code/myproject-parallel/agent1
claude-cli

<span class="c"># Tab 2: Database migration</span>
<span class="nb">cd</span> ~/code/myproject-parallel/agent2
claude-cli

<span class="c"># Tab 3: Documentation updates</span>
<span class="nb">cd</span> ~/code/myproject-parallel/agent3
claude-cli
</code></pre></div></div>

<p>This maps to the cognitive model of working on multiple features. Each lives in its own mental and physical space until it’s ready to merge back to main.</p>

<h2 id="comparing-approaches-vibe-coding-vs-claude-code">Comparing Approaches: Vibe Coding vs. Claude Code</h2>

<p>Having tested both paradigms extensively, the difference is clear:</p>

<p><strong>Vibe Coding Platforms</strong> (Replit, Tempo, Lovable, etc.):</p>
<ul>
  <li>Optimized for rapid prototyping and demos</li>
  <li>Single-agent, conversational interface</li>
  <li>Struggle with iterative refinement</li>
  <li>Break down when adding features to existing code</li>
  <li>Best for initial generation</li>
</ul>

<p><strong>Claude Code</strong> (Boris’s Workflow):</p>
<ul>
  <li>Optimized for production development</li>
  <li>Multi-agent parallel processing</li>
  <li>Designed for iterative improvement</li>
  <li>Handles complex refactors through planning and verification</li>
  <li>Best for real applications</li>
</ul>

<p>The vibe coding platforms excel at the first 20% of development, which is getting a working prototype fast. Claude Code, used properly, excels at the remaining 80%, which includes refining, extending, and maintaining real applications over time.</p>

<h2 id="lessons-for-your-workflow">Lessons for Your Workflow</h2>

<p>Even if you’re not ready to run 15 parallel Claude agents, several principles apply immediately:</p>

<h3 id="1-document-patterns-in-claudemd">1. Document Patterns in CLAUDE.md</h3>

<p>Start with a simple file:</p>
<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gh"># Project Guidelines</span>

<span class="gu">## Tech Stack</span>
<span class="p">-</span> Use TypeScript strict mode
<span class="p">-</span> Prefer functional components in React
<span class="p">-</span> Use Tailwind for styling

<span class="gu">## Testing</span>
<span class="p">-</span> Write tests before implementation
<span class="p">-</span> Use Jest for unit tests
<span class="p">-</span> Use Playwright for E2E tests

<span class="gu">## Common Mistakes to Avoid</span>
<span class="p">-</span> Don't use <span class="sb">`any`</span> type
<span class="p">-</span> Don't commit console.logs
<span class="p">-</span> Don't skip error handling
</code></pre></div></div>

<p>Update it whenever Claude makes a mistake or you establish a new pattern.</p>

<h3 id="2-use-plan-mode-for-complex-tasks">2. Use Plan Mode for Complex Tasks</h3>

<p>Before implementing anything non-trivial:</p>
<ol>
  <li>Switch to Plan mode</li>
  <li>Describe your goal</li>
  <li>Iterate on the approach</li>
  <li>Only then switch to implementation</li>
</ol>

<p>This single change will dramatically improve success rate.</p>

<h3 id="3-create-slash-commands-for-repetitive-workflows">3. Create Slash Commands for Repetitive Workflows</h3>

<p>Identify the 3-5 workflows you do most often:</p>
<ul>
  <li>Running tests</li>
  <li>Committing and pushing</li>
  <li>Building and deploying</li>
  <li>Generating documentation</li>
  <li>Running type checks</li>
</ul>

<p>Create slash commands for these and share them with your team.</p>

<h3 id="4-build-verification-loops">4. Build Verification Loops</h3>

<p>For every project, invest in making verification fast and automated:</p>
<ul>
  <li>Set up hot-reload for web apps</li>
  <li>Create test scripts that run in &lt;10 seconds</li>
  <li>Build sample data generators for testing</li>
  <li>Set up automated E2E tests for critical paths</li>
</ul>

<p>Then give Claude access to run these verifications.</p>

<h3 id="5-start-with-2-3-parallel-agents">5. Start with 2-3 Parallel Agents</h3>

<p>You don’t need 15 agents on day one. Start with 2-3:</p>
<ul>
  <li><strong>Agent 1</strong>: Feature implementation</li>
  <li><strong>Agent 2</strong>: Test writing</li>
  <li><strong>Agent 3</strong>: Documentation updates</li>
</ul>

<p>Even this modest parallelism will change how you work.</p>

<h2 id="the-future-of-ai-assisted-development">The Future of AI-Assisted Development</h2>

<p>Boris’s workflow represents a mature approach to AI-assisted development. It’s not about replacing developers or eliminating coding—it’s about orchestrating AI agents as force multipliers.</p>

<p>The developers who will thrive aren’t those who can prompt the hardest, but those who can:</p>
<ul>
  <li>Architect systems that AI agents can navigate</li>
  <li>Create verification loops that enable autonomy</li>
  <li>Build knowledge bases (like <code class="language-plaintext highlighter-rouge">CLAUDE.md</code>) that compound over time</li>
  <li>Manage parallel workstreams effectively</li>
  <li>Integrate AI into existing tool ecosystems</li>
</ul>

<p>This echoes my conclusion from <a href="/coding-beyond-ai.html">Coding Beyond AI</a>: the future belongs to those who can architect, direct, and orchestrate AI tools, not those who simply use them as fancy autocomplete.</p>

<h2 id="getting-started-your-path-to-multi-agent-development">Getting Started: Your Path to Multi-Agent Development</h2>

<p>If you’re using Claude Code (or any AI coding assistant), here’s how to progressively build up to Boris’s workflow. Each step builds on the previous one, and you should only move to the next step once you’re comfortable with the current one.</p>

<h3 id="step-1-create-your-claudemd-file">Step 1: Create Your CLAUDE.md File</h3>

<p>Start by creating a file called <code class="language-plaintext highlighter-rouge">CLAUDE.md</code> in your project’s root directory. This will be Claude’s instruction manual for your codebase.</p>

<p><strong>Initial template to get started:</strong></p>

<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gh"># Project Guidelines</span>

<span class="gu">## Tech Stack</span>
<span class="p">-</span> Primary language: [Your language here]
<span class="p">-</span> Framework: [Your framework]
<span class="p">-</span> Package manager: [npm, yarn, bun, etc.]

<span class="gu">## Testing</span>
<span class="p">-</span> Test framework: [Jest, Vitest, etc.]
<span class="p">-</span> Command to run tests: [your command]

<span class="gu">## Common Commands</span>
<span class="p">-</span> Start dev server: [your command]
<span class="p">-</span> Build for production: [your command]
<span class="p">-</span> Run linter: [your command]

<span class="gu">## Common Mistakes to Avoid</span>
(Start empty - you'll add to this as you go)
</code></pre></div></div>

<p>As you work with Claude, anytime it makes a mistake or you establish a new pattern, add it to the “Common Mistakes to Avoid” section. For example:</p>

<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gu">## Common Mistakes to Avoid</span>
<span class="p">-</span> Don't use <span class="sb">`any`</span> type in TypeScript
<span class="p">-</span> Don't commit console.logs to production code
<span class="p">-</span> Always include error handling in async functions
<span class="p">-</span> Prefer functional components over class components in React
</code></pre></div></div>

<p>This file compounds in value over time. After a month of updates, Claude will know your codebase’s quirks better than most new team members.</p>

<h3 id="step-2-learn-plan-mode-for-complex-tasks">Step 2: Learn Plan Mode for Complex Tasks</h3>

<p>Plan mode is accessed by pressing <code class="language-plaintext highlighter-rouge">Shift+Tab</code> twice in Claude Code. It changes how Claude approaches your request.</p>

<p><strong>Without Plan mode:</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>You: "Add authentication to the app"
Claude: [Immediately starts writing code]
</code></pre></div></div>

<p><strong>With Plan mode:</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>You: "Add authentication to the app"
Claude: [Provides a detailed plan]
  1. Set up authentication provider (e.g., Auth0, Supabase)
  2. Create login/signup components
  3. Add protected route wrapper
  4. Implement token storage
  5. Add logout functionality
  
You: "Looks good, but let's use session storage instead of local storage"
Claude: [Updates plan]
You: "Perfect, proceed with implementation"
Claude: [Implements the updated plan]
</code></pre></div></div>

<p><strong>When to use Plan mode:</strong></p>
<ul>
  <li>Implementing new features</li>
  <li>Large refactors</li>
  <li>Anything that touches multiple files</li>
  <li>When you’re not 100% sure of the approach</li>
</ul>

<p><strong>When to skip Plan mode:</strong></p>
<ul>
  <li>Fixing typos or simple bugs</li>
  <li>Updating documentation</li>
  <li>Making obvious, small changes</li>
</ul>

<p>Start using Plan mode for any task that would take more than 5 minutes to implement manually.</p>

<h3 id="step-3-set-up-your-first-slash-command">Step 3: Set Up Your First Slash Command</h3>

<p>Slash commands automate repetitive workflows. Let’s create a simple one for running your test suite.</p>

<p>Create a directory structure:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>your-project/
  .claude/
    commands/
      test.sh
</code></pre></div></div>

<p><strong>Example test.sh:</strong></p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>
<span class="c"># Runs the test suite with common options</span>

<span class="nb">echo</span> <span class="s2">"Running test suite..."</span>

<span class="c"># Run tests with coverage</span>
npm <span class="nb">test</span> <span class="nt">--</span> <span class="nt">--coverage</span> <span class="nt">--watchAll</span><span class="o">=</span><span class="nb">false</span>

<span class="c"># Check if tests passed</span>
<span class="k">if</span> <span class="o">[</span> <span class="nv">$?</span> <span class="nt">-eq</span> 0 <span class="o">]</span><span class="p">;</span> <span class="k">then
    </span><span class="nb">echo</span> <span class="s2">"✅ All tests passed!"</span>
<span class="k">else
    </span><span class="nb">echo</span> <span class="s2">"❌ Tests failed. Review output above."</span>
    <span class="nb">exit </span>1
<span class="k">fi</span>
</code></pre></div></div>

<p>Make it executable:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">chmod</span> +x .claude/commands/test.sh
</code></pre></div></div>

<p>Now when you type <code class="language-plaintext highlighter-rouge">/test</code> in Claude Code, it will run this script. You can create commands for:</p>
<ul>
  <li><code class="language-plaintext highlighter-rouge">/commit-push</code> - Commits and pushes changes</li>
  <li><code class="language-plaintext highlighter-rouge">/deploy</code> - Deploys to staging</li>
  <li><code class="language-plaintext highlighter-rouge">/format</code> - Runs code formatter</li>
  <li><code class="language-plaintext highlighter-rouge">/typecheck</code> - Runs TypeScript type checking</li>
</ul>

<p>The key is identifying the workflows you do most often and automating them. Start with just one or two commands, then add more as you identify patterns.</p>

<h3 id="step-4-build-your-first-verification-loop">Step 4: Build Your First Verification Loop</h3>

<p>Verification loops let Claude check its own work, which Boris identified as the single most important factor for quality.</p>

<p><strong>For a web application:</strong></p>

<p>Create a simple test script that Claude can run:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>
<span class="c"># .claude/commands/verify-app.sh</span>

<span class="nb">echo</span> <span class="s2">"Starting app verification..."</span>

<span class="c"># Start the dev server in the background</span>
npm run dev &amp;
<span class="nv">DEV_PID</span><span class="o">=</span><span class="nv">$!</span>

<span class="c"># Wait for server to start</span>
<span class="nb">sleep </span>5

<span class="c"># Check if app is responding</span>
<span class="k">if </span>curl <span class="nt">-s</span> http://localhost:3000 <span class="o">&gt;</span> /dev/null<span class="p">;</span> <span class="k">then
    </span><span class="nb">echo</span> <span class="s2">"✅ App started successfully"</span>
    
    <span class="c"># Run basic smoke tests</span>
    npm run <span class="nb">test</span>:e2e
    
    <span class="nv">RESULT</span><span class="o">=</span><span class="nv">$?</span>
<span class="k">else
    </span><span class="nb">echo</span> <span class="s2">"❌ App failed to start"</span>
    <span class="nv">RESULT</span><span class="o">=</span>1
<span class="k">fi</span>

<span class="c"># Clean up</span>
<span class="nb">kill</span> <span class="nv">$DEV_PID</span>

<span class="nb">exit</span> <span class="nv">$RESULT</span>
</code></pre></div></div>

<p><strong>For an API:</strong></p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>
<span class="c"># .claude/commands/verify-api.sh</span>

<span class="nb">echo</span> <span class="s2">"Verifying API endpoints..."</span>

<span class="c"># Start API server</span>
npm run start:test &amp;
<span class="nv">API_PID</span><span class="o">=</span><span class="nv">$!</span>

<span class="nb">sleep </span>3

<span class="c"># Test key endpoints</span>
<span class="nb">echo</span> <span class="s2">"Testing /health endpoint..."</span>
curl <span class="nt">-f</span> http://localhost:8080/health <span class="o">||</span> <span class="nb">exit </span>1

<span class="nb">echo</span> <span class="s2">"Testing /api/users endpoint..."</span>
curl <span class="nt">-f</span> http://localhost:8080/api/users <span class="o">||</span> <span class="nb">exit </span>1

<span class="nb">echo</span> <span class="s2">"✅ All endpoints responding"</span>

<span class="nb">kill</span> <span class="nv">$API_PID</span>
</code></pre></div></div>

<p>Now when Claude makes changes, you can ask it to run <code class="language-plaintext highlighter-rouge">/verify-app</code> or <code class="language-plaintext highlighter-rouge">/verify-api</code> to check its work. Even better, set up a hook to run verification automatically.</p>

<h3 id="step-5-start-running-multiple-agents">Step 5: Start Running Multiple Agents</h3>

<p>Once you’re comfortable with the above, you’re ready to run multiple Claude agents in parallel.</p>

<p><strong>Terminal setup (iTerm2 on Mac):</strong></p>

<ol>
  <li>Open iTerm2 Preferences (Cmd+,)</li>
  <li>Go to Profiles &gt; Your Profile &gt; Terminal</li>
  <li>Enable “Notifications when idle for” - set to 5 seconds</li>
  <li>Check “Send notification when current session’s…”</li>
</ol>

<p>This will give you a notification when Claude needs your input.</p>

<p><strong>Running 2-3 agents to start:</strong></p>

<p>Open 3 terminal tabs and number them:</p>
<ul>
  <li>Tab 1: Feature implementation</li>
  <li>Tab 2: Test writing</li>
  <li>Tab 3: Documentation updates</li>
</ul>

<p>In each tab, start a Claude session:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Tab 1</span>
claude-cli

<span class="c"># In Claude Code</span>
<span class="o">&gt;</span> Implement user authentication feature

<span class="c"># Tab 2  </span>
claude-cli

<span class="c"># In Claude Code</span>
<span class="o">&gt;</span> Write comprehensive tests <span class="k">for </span>the authentication feature

<span class="c"># Tab 3</span>
claude-cli

<span class="c"># In Claude Code</span>
<span class="o">&gt;</span> Update README and API documentation <span class="k">for </span>authentication
</code></pre></div></div>

<p>Now you can work on all three in parallel. When Tab 1 finishes implementation, you’ll get a notification. Review it, and while Claude in Tab 2 is still writing tests, you can start Tab 1 on the next feature.</p>

<p><strong>Managing multiple branches:</strong></p>

<p>Each agent should work on its own branch to avoid conflicts:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Tab 1</span>
git checkout <span class="nt">-b</span> feature/auth
claude-cli <span class="nt">--branch</span> feature/auth

<span class="c"># Tab 2</span>
git checkout <span class="nt">-b</span> feature/auth-tests  
claude-cli <span class="nt">--branch</span> feature/auth-tests

<span class="c"># Tab 3</span>
git checkout <span class="nt">-b</span> docs/auth
claude-cli <span class="nt">--branch</span> docs/auth
</code></pre></div></div>

<p>This keeps work isolated until you’re ready to merge.</p>

<h3 id="step-6-integrate-with-your-tools-via-mcp">Step 6: Integrate with Your Tools via MCP</h3>

<p>MCP (Model Context Protocol) lets Claude interact with your external tools. Start with one or two integrations that would save you the most time.</p>

<p><strong>Example: Slack integration</strong></p>

<p>Create <code class="language-plaintext highlighter-rouge">.mcp.json</code> in your project root:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"mcpServers"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"slack"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"http"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"url"</span><span class="p">:</span><span class="w"> </span><span class="s2">"https://slack.mcp.anthropic.com/mcp"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"token"</span><span class="p">:</span><span class="w"> </span><span class="s2">"your-slack-token"</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>Now Claude can search Slack for context:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>You: "Check our Slack discussions about the authentication implementation"
Claude: [Searches Slack, finds relevant threads]
</code></pre></div></div>

<p><strong>Common MCP integrations:</strong></p>
<ul>
  <li>Slack for team communications</li>
  <li>Linear/Jira for issue tracking</li>
  <li>Sentry for error monitoring</li>
  <li>DataDog for metrics</li>
  <li>Any tool with a CLI or API</li>
</ul>

<p>Start with whichever tool you find yourself manually checking most often during development.</p>

<h2 id="conclusion">Conclusion</h2>

<p>Boris Cherny’s workflow isn’t just about using Claude Code effectively. It’s a blueprint for the future of software development. The multi-agent approach, combined with team knowledge bases, automated verification, and thoughtful integration of AI into existing tools, creates a development environment that’s both more powerful and more maintainable than traditional workflows.</p>

<p>The irony is that this “advanced” setup isn’t actually that complicated. Most of it is configuration files checked into git, commands that run automatically, and patterns that your team already follows (now just documented for AI to understand).</p>

<p>The barrier isn’t technical complexity. It’s the shift in thinking from “AI as tool” to “AI as colleague.” Once you make that leap, the possibilities expand dramatically.</p>

<p>Start small. Pick one or two practices from this article and implement them this week. Add your CLAUDE.md file. Try Plan mode. Create a verification script. The compounding returns will surprise you.</p>

<hr />

<p><em>Have you experimented with multi-agent workflows? Share your experiences in the comments below, or reach out to me on <a href="https://linkedin.com/in/chamlian/">LinkedIn</a> to discuss your AI development strategies.</em></p>

<p><strong>My Claude Code Skills Repo to get you started:</strong></p>
<ul>
  <li><a href="https://github.com/angakh/claude-code-skills">https://github.com/angakh/claude-code-skills</a></li>
</ul>

<p><strong>Related Reading:</strong></p>
<ul>
  <li><a href="/claude-code-skills-guide.html">Creating Custom Skills in Claude Code</a></li>
  <li><a href="/coding-beyond-ai.html">Coding Beyond AI</a></li>
</ul>]]></content><author><name>Vatché</name></author><category term="AI" /><category term="Development" /><category term="Claude Code" /><category term="Claude Code" /><category term="Development &amp; DevOps" /><category term="Artificial Intelligence" /><category term="Workflow Optimization" /><summary type="html"><![CDATA[Inside Boris Cherny's production workflow—running 15+ parallel Claude agents, maintaining team-wide CLAUDE.md files, and leveraging advanced features most developers miss.]]></summary></entry><entry><title type="html">Creating Custom Skills in Claude Code: Automating Your Development Workflow</title><link href="https://vatchechamlian.com/claude-code-skills-guide.html" rel="alternate" type="text/html" title="Creating Custom Skills in Claude Code: Automating Your Development Workflow" /><published>2025-12-13T00:00:00+00:00</published><updated>2025-12-13T00:00:00+00:00</updated><id>https://vatchechamlian.com/claude-code-skills-guide</id><content type="html" xml:base="https://vatchechamlian.com/claude-code-skills-guide.html"><![CDATA[<p>If you’ve used Claude Code for any length of time, you’ve probably found yourself repeatedly prompting Claude to do the same tasks. Run tests. Format code. Commit changes. Deploy to staging. These repetitive prompts slow you down and introduce inconsistency.</p>

<p>The solution? Custom skills, which Claude Code calls “slash commands.” These are reusable scripts that Claude (and you) can invoke with a simple command like <code class="language-plaintext highlighter-rouge">/test</code> or <code class="language-plaintext highlighter-rouge">/deploy</code>. They’re essentially bash scripts with superpowers, and they’re one of the most underutilized features of Claude Code.</p>

<h2 id="what-are-skills">What Are Skills?</h2>

<p>Skills in Claude Code are executable scripts stored in your project’s <code class="language-plaintext highlighter-rouge">.claude/commands/</code> directory. They can be written in bash, Python, Node.js, or any language that can run on your system. Once created, both you and Claude can invoke them by name.</p>

<p><strong>The key difference from regular scripts:</strong></p>
<ul>
  <li>Skills are <strong>git-tracked</strong> so your whole team shares them</li>
  <li>Skills are <strong>discoverable</strong> by Claude without additional prompting</li>
  <li>Skills can <strong>pre-compute context</strong> to avoid back-and-forth with the AI</li>
  <li>Skills appear in <strong>Claude’s autocomplete</strong> when you type <code class="language-plaintext highlighter-rouge">/</code></li>
</ul>

<h2 id="your-first-skill-running-tests">Your First Skill: Running Tests</h2>

<p>Let’s start with something simple but useful. A skill that runs your test suite with the right options.</p>

<p>Create the file <code class="language-plaintext highlighter-rouge">.claude/commands/test.sh</code>:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>
<span class="c"># Runs the test suite with coverage</span>

<span class="nb">echo</span> <span class="s2">"Running test suite..."</span>

<span class="c"># Detect which test runner you're using</span>
<span class="k">if</span> <span class="o">[</span> <span class="nt">-f</span> <span class="s2">"package.json"</span> <span class="o">]</span><span class="p">;</span> <span class="k">then
    if </span><span class="nb">grep</span> <span class="nt">-q</span> <span class="s2">"vitest"</span> package.json<span class="p">;</span> <span class="k">then
        </span>npm run <span class="nb">test</span> <span class="nt">--</span> <span class="nt">--coverage</span>
    <span class="k">elif </span><span class="nb">grep</span> <span class="nt">-q</span> <span class="s2">"jest"</span> package.json<span class="p">;</span> <span class="k">then
        </span>npm <span class="nb">test</span> <span class="nt">--</span> <span class="nt">--coverage</span> <span class="nt">--watchAll</span><span class="o">=</span><span class="nb">false
    </span><span class="k">else
        </span>npm <span class="nb">test
    </span><span class="k">fi
elif</span> <span class="o">[</span> <span class="nt">-f</span> <span class="s2">"pytest.ini"</span> <span class="o">]</span> <span class="o">||</span> <span class="o">[</span> <span class="nt">-f</span> <span class="s2">"pyproject.toml"</span> <span class="o">]</span><span class="p">;</span> <span class="k">then
    </span>pytest <span class="nt">--cov</span>
<span class="k">elif</span> <span class="o">[</span> <span class="nt">-f</span> <span class="s2">"go.mod"</span> <span class="o">]</span><span class="p">;</span> <span class="k">then
    </span>go <span class="nb">test</span> ./... <span class="nt">-cover</span>
<span class="k">else
    </span><span class="nb">echo</span> <span class="s2">"❌ Could not detect test framework"</span>
    <span class="nb">exit </span>1
<span class="k">fi</span>

<span class="c"># Report results</span>
<span class="k">if</span> <span class="o">[</span> <span class="nv">$?</span> <span class="nt">-eq</span> 0 <span class="o">]</span><span class="p">;</span> <span class="k">then
    </span><span class="nb">echo</span> <span class="s2">"✅ All tests passed!"</span>
<span class="k">else
    </span><span class="nb">echo</span> <span class="s2">"❌ Tests failed. Review output above."</span>
    <span class="nb">exit </span>1
<span class="k">fi</span>
</code></pre></div></div>

<p>Make it executable:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">chmod</span> +x .claude/commands/test.sh
</code></pre></div></div>

<p>Now you (or Claude) can simply type <code class="language-plaintext highlighter-rouge">/test</code> and it will run with the appropriate test runner and options.</p>

<h2 id="the-anatomy-of-a-good-skill">The Anatomy of a Good Skill</h2>

<p>Looking at that test skill, notice a few important patterns:</p>

<h3 id="1-clear-feedback">1. Clear Feedback</h3>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">echo</span> <span class="s2">"Running test suite..."</span>
<span class="nb">echo</span> <span class="s2">"✅ All tests passed!"</span>
</code></pre></div></div>

<p>Skills should tell you what they’re doing and what happened. Claude uses this output to understand if the skill succeeded.</p>

<h3 id="2-exit-codes-matter">2. Exit Codes Matter</h3>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="o">[</span> <span class="nv">$?</span> <span class="nt">-eq</span> 0 <span class="o">]</span><span class="p">;</span> <span class="k">then
    </span><span class="nb">echo</span> <span class="s2">"✅ All tests passed!"</span>
<span class="k">else
    </span><span class="nb">echo</span> <span class="s2">"❌ Tests failed."</span>
    <span class="nb">exit </span>1
<span class="k">fi</span>
</code></pre></div></div>

<p>Return <code class="language-plaintext highlighter-rouge">0</code> for success, non-zero for failure. Claude uses this to know if it should continue or investigate the error.</p>

<h3 id="3-context-detection">3. Context Detection</h3>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if </span><span class="nb">grep</span> <span class="nt">-q</span> <span class="s2">"vitest"</span> package.json<span class="p">;</span> <span class="k">then
    </span>npm run <span class="nb">test</span> <span class="nt">--</span> <span class="nt">--coverage</span>
<span class="k">elif </span><span class="nb">grep</span> <span class="nt">-q</span> <span class="s2">"jest"</span> package.json<span class="p">;</span> <span class="k">then
    </span>npm <span class="nb">test</span> <span class="nt">--</span> <span class="nt">--coverage</span> <span class="nt">--watchAll</span><span class="o">=</span><span class="nb">false</span>
</code></pre></div></div>

<p>Good skills adapt to your project automatically. This test skill works with multiple test frameworks without you needing to remember which one you’re using.</p>

<h2 id="a-more-complex-example-smart-commit">A More Complex Example: Smart Commit</h2>

<p>Let’s create a skill that makes committing code intelligent. It will:</p>
<ul>
  <li>Check for uncommitted changes</li>
  <li>Run tests before committing</li>
  <li>Generate a meaningful commit message suggestion</li>
  <li>Handle the git workflow</li>
</ul>

<p>Create <code class="language-plaintext highlighter-rouge">.claude/commands/commit.sh</code>:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>
<span class="c"># Smart commit with tests and meaningful messages</span>

<span class="nb">set</span> <span class="nt">-e</span>  <span class="c"># Exit on any error</span>

<span class="c"># Check for changes</span>
<span class="nv">MODIFIED_FILES</span><span class="o">=</span><span class="si">$(</span>git diff <span class="nt">--name-only</span><span class="si">)</span>
<span class="nv">STAGED_FILES</span><span class="o">=</span><span class="si">$(</span>git diff <span class="nt">--cached</span> <span class="nt">--name-only</span><span class="si">)</span>

<span class="k">if</span> <span class="o">[</span> <span class="nt">-z</span> <span class="s2">"</span><span class="nv">$MODIFIED_FILES</span><span class="s2">"</span> <span class="o">]</span> <span class="o">&amp;&amp;</span> <span class="o">[</span> <span class="nt">-z</span> <span class="s2">"</span><span class="nv">$STAGED_FILES</span><span class="s2">"</span> <span class="o">]</span><span class="p">;</span> <span class="k">then
    </span><span class="nb">echo</span> <span class="s2">"❌ No changes to commit"</span>
    <span class="nb">exit </span>1
<span class="k">fi

</span><span class="nb">echo</span> <span class="s2">"📝 Changes detected in:"</span>
<span class="nb">echo</span> <span class="s2">"</span><span class="nv">$MODIFIED_FILES</span><span class="s2">"</span>
<span class="nb">echo</span> <span class="s2">""</span>

<span class="c"># Run tests first</span>
<span class="nb">echo</span> <span class="s2">"🧪 Running tests before commit..."</span>
./.claude/commands/test.sh

<span class="k">if</span> <span class="o">[</span> <span class="nv">$?</span> <span class="nt">-ne</span> 0 <span class="o">]</span><span class="p">;</span> <span class="k">then
    </span><span class="nb">echo</span> <span class="s2">"❌ Tests failed. Fix them before committing."</span>
    <span class="nb">exit </span>1
<span class="k">fi</span>

<span class="c"># Generate commit message suggestion</span>
<span class="nb">echo</span> <span class="s2">""</span>
<span class="nb">echo</span> <span class="s2">"Analyzing changes for commit message..."</span>

<span class="c"># Get the types of files changed</span>
<span class="nv">HAS_TESTS</span><span class="o">=</span><span class="si">$(</span><span class="nb">echo</span> <span class="s2">"</span><span class="nv">$MODIFIED_FILES</span><span class="s2">"</span> | <span class="nb">grep</span> <span class="nt">-c</span> <span class="s2">"test</span><span class="se">\|</span><span class="s2">spec"</span> <span class="o">||</span> <span class="nb">true</span><span class="si">)</span>
<span class="nv">HAS_DOCS</span><span class="o">=</span><span class="si">$(</span><span class="nb">echo</span> <span class="s2">"</span><span class="nv">$MODIFIED_FILES</span><span class="s2">"</span> | <span class="nb">grep</span> <span class="nt">-c</span> <span class="s2">"README</span><span class="se">\|\.</span><span class="s2">md"</span> <span class="o">||</span> <span class="nb">true</span><span class="si">)</span>
<span class="nv">HAS_CONFIG</span><span class="o">=</span><span class="si">$(</span><span class="nb">echo</span> <span class="s2">"</span><span class="nv">$MODIFIED_FILES</span><span class="s2">"</span> | <span class="nb">grep</span> <span class="nt">-c</span> <span class="s2">"config</span><span class="se">\|\.</span><span class="s2">json</span><span class="se">\|\.</span><span class="s2">yaml"</span> <span class="o">||</span> <span class="nb">true</span><span class="si">)</span>

<span class="c"># Suggest a commit message prefix</span>
<span class="k">if</span> <span class="o">[</span> <span class="nv">$HAS_TESTS</span> <span class="nt">-gt</span> 0 <span class="o">]</span><span class="p">;</span> <span class="k">then
    </span><span class="nb">echo</span> <span class="s2">"💡 Suggested prefix: test: "</span>
<span class="k">elif</span> <span class="o">[</span> <span class="nv">$HAS_DOCS</span> <span class="nt">-gt</span> 0 <span class="o">]</span><span class="p">;</span> <span class="k">then
    </span><span class="nb">echo</span> <span class="s2">"💡 Suggested prefix: docs: "</span>
<span class="k">elif</span> <span class="o">[</span> <span class="nv">$HAS_CONFIG</span> <span class="nt">-gt</span> 0 <span class="o">]</span><span class="p">;</span> <span class="k">then
    </span><span class="nb">echo</span> <span class="s2">"💡 Suggested prefix: config: "</span>
<span class="k">else
    </span><span class="nb">echo</span> <span class="s2">"💡 Suggested prefix: feat: or fix: "</span>
<span class="k">fi

</span><span class="nb">echo</span> <span class="s2">""</span>
<span class="nb">read</span> <span class="nt">-p</span> <span class="s2">"Commit message: "</span> COMMIT_MSG

<span class="k">if</span> <span class="o">[</span> <span class="nt">-z</span> <span class="s2">"</span><span class="nv">$COMMIT_MSG</span><span class="s2">"</span> <span class="o">]</span><span class="p">;</span> <span class="k">then
    </span><span class="nb">echo</span> <span class="s2">"❌ Commit message required"</span>
    <span class="nb">exit </span>1
<span class="k">fi</span>

<span class="c"># Stage and commit</span>
git add <span class="nt">-A</span>
git commit <span class="nt">-m</span> <span class="s2">"</span><span class="nv">$COMMIT_MSG</span><span class="s2">"</span>

<span class="nb">echo</span> <span class="s2">"✅ Committed: </span><span class="nv">$COMMIT_MSG</span><span class="s2">"</span>

<span class="c"># Ask about pushing</span>
<span class="nb">read</span> <span class="nt">-p</span> <span class="s2">"Push to remote? (y/n): "</span> SHOULD_PUSH
<span class="k">if</span> <span class="o">[</span> <span class="s2">"</span><span class="nv">$SHOULD_PUSH</span><span class="s2">"</span> <span class="o">=</span> <span class="s2">"y"</span> <span class="o">]</span><span class="p">;</span> <span class="k">then
    </span><span class="nv">BRANCH</span><span class="o">=</span><span class="si">$(</span>git branch <span class="nt">--show-current</span><span class="si">)</span>
    git push <span class="nt">-u</span> origin <span class="s2">"</span><span class="nv">$BRANCH</span><span class="s2">"</span>
    <span class="nb">echo</span> <span class="s2">"✅ Pushed to origin/</span><span class="nv">$BRANCH</span><span class="s2">"</span>
<span class="k">fi</span>
</code></pre></div></div>

<p>Now <code class="language-plaintext highlighter-rouge">/commit</code> becomes a smart workflow that ensures quality before committing.</p>

<h2 id="skills-that-pre-compute-context">Skills That Pre-Compute Context</h2>

<p>One of the most powerful patterns is pre-computing information that Claude would otherwise need to ask about. This eliminates back-and-forth and makes skills faster.</p>

<p><strong>Example: Status skill</strong></p>

<p>Create <code class="language-plaintext highlighter-rouge">.claude/commands/status.sh</code>:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>
<span class="c"># Shows comprehensive project status</span>

<span class="nb">echo</span> <span class="s2">"📊 Project Status Report"</span>
<span class="nb">echo</span> <span class="s2">"========================"</span>
<span class="nb">echo</span> <span class="s2">""</span>

<span class="c"># Git status</span>
<span class="nb">echo</span> <span class="s2">"🔀 Git Status:"</span>
<span class="nv">BRANCH</span><span class="o">=</span><span class="si">$(</span>git branch <span class="nt">--show-current</span><span class="si">)</span>
<span class="nv">MODIFIED</span><span class="o">=</span><span class="si">$(</span>git diff <span class="nt">--name-only</span> | <span class="nb">wc</span> <span class="nt">-l</span><span class="si">)</span>
<span class="nv">STAGED</span><span class="o">=</span><span class="si">$(</span>git diff <span class="nt">--cached</span> <span class="nt">--name-only</span> | <span class="nb">wc</span> <span class="nt">-l</span><span class="si">)</span>
<span class="nb">echo</span> <span class="s2">"  Branch: </span><span class="nv">$BRANCH</span><span class="s2">"</span>
<span class="nb">echo</span> <span class="s2">"  Modified files: </span><span class="nv">$MODIFIED</span><span class="s2">"</span>
<span class="nb">echo</span> <span class="s2">"  Staged files: </span><span class="nv">$STAGED</span><span class="s2">"</span>
<span class="nb">echo</span> <span class="s2">""</span>

<span class="c"># Dependency status</span>
<span class="nb">echo</span> <span class="s2">"📦 Dependencies:"</span>
<span class="k">if</span> <span class="o">[</span> <span class="nt">-f</span> <span class="s2">"package.json"</span> <span class="o">]</span><span class="p">;</span> <span class="k">then
    </span><span class="nv">OUTDATED</span><span class="o">=</span><span class="si">$(</span>npm outdated | <span class="nb">tail</span> <span class="nt">-n</span> +2 | <span class="nb">wc</span> <span class="nt">-l</span> <span class="o">||</span> <span class="nb">echo</span> <span class="s2">"0"</span><span class="si">)</span>
    <span class="nb">echo</span> <span class="s2">"  Outdated packages: </span><span class="nv">$OUTDATED</span><span class="s2">"</span>
<span class="k">fi
</span><span class="nb">echo</span> <span class="s2">""</span>

<span class="c"># Test status</span>
<span class="nb">echo</span> <span class="s2">"🧪 Last Test Run:"</span>
<span class="k">if</span> <span class="o">[</span> <span class="nt">-f</span> <span class="s2">"coverage/coverage-summary.json"</span> <span class="o">]</span><span class="p">;</span> <span class="k">then
    </span><span class="nv">COVERAGE</span><span class="o">=</span><span class="si">$(</span><span class="nb">cat </span>coverage/coverage-summary.json | <span class="nb">grep</span> <span class="nt">-o</span> <span class="s1">'"lines":{"pct":[0-9.]*'</span> | <span class="nb">grep</span> <span class="nt">-o</span> <span class="s1">'[0-9.]*$'</span><span class="si">)</span>
    <span class="nb">echo</span> <span class="s2">"  Coverage: </span><span class="nv">$COVERAGE</span><span class="s2">%"</span>
<span class="k">else
    </span><span class="nb">echo</span> <span class="s2">"  No coverage data available"</span>
<span class="k">fi
</span><span class="nb">echo</span> <span class="s2">""</span>

<span class="c"># Build status</span>
<span class="nb">echo</span> <span class="s2">"🏗️  Build Status:"</span>
<span class="k">if</span> <span class="o">[</span> <span class="nt">-d</span> <span class="s2">"dist"</span> <span class="o">]</span> <span class="o">||</span> <span class="o">[</span> <span class="nt">-d</span> <span class="s2">"build"</span> <span class="o">]</span><span class="p">;</span> <span class="k">then
    </span><span class="nb">echo</span> <span class="s2">"  ✅ Build artifacts present"</span>
<span class="k">else
    </span><span class="nb">echo</span> <span class="s2">"  ⚠️  No build artifacts found"</span>
<span class="k">fi</span>
</code></pre></div></div>

<p>Now when Claude needs to understand your project state, it can run <code class="language-plaintext highlighter-rouge">/status</code> once and get everything, rather than running multiple separate commands.</p>

<h2 id="skills-for-different-workflows">Skills for Different Workflows</h2>

<p>Here are templates for common development workflows:</p>

<h3 id="code-quality-checks">Code Quality Checks</h3>
<p><code class="language-plaintext highlighter-rouge">.claude/commands/quality.sh</code>:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>
<span class="nb">echo</span> <span class="s2">"Running code quality checks..."</span>

<span class="c"># Linting</span>
<span class="nb">echo</span> <span class="s2">"1/3 Linting..."</span>
npm run lint

<span class="c"># Type checking</span>
<span class="nb">echo</span> <span class="s2">"2/3 Type checking..."</span>
npm run typecheck

<span class="c"># Tests</span>
<span class="nb">echo</span> <span class="s2">"3/3 Testing..."</span>
./.claude/commands/test.sh

<span class="nb">echo</span> <span class="s2">"✅ All quality checks passed!"</span>
</code></pre></div></div>

<h3 id="quick-fix-workflow">Quick Fix Workflow</h3>
<p><code class="language-plaintext highlighter-rouge">.claude/commands/quick-fix.sh</code>:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>
<span class="c"># Auto-fix common issues</span>

<span class="nb">echo</span> <span class="s2">"Applying automatic fixes..."</span>

<span class="c"># Fix linting issues</span>
npm run lint <span class="nt">--</span> <span class="nt">--fix</span>

<span class="c"># Format code</span>
npm run format

<span class="c"># Update imports</span>
npm run organize-imports

<span class="nb">echo</span> <span class="s2">"✅ Auto-fixes applied. Review changes before committing."</span>
</code></pre></div></div>

<h3 id="deployment">Deployment</h3>
<p><code class="language-plaintext highlighter-rouge">.claude/commands/deploy-staging.sh</code>:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>
<span class="c"># Deploy to staging environment</span>

<span class="nb">set</span> <span class="nt">-e</span>

<span class="nb">echo</span> <span class="s2">"🚀 Deploying to staging..."</span>

<span class="c"># Run quality checks first</span>
./.claude/commands/quality.sh

<span class="c"># Build production bundle</span>
<span class="nb">echo</span> <span class="s2">"Building production bundle..."</span>
npm run build

<span class="c"># Deploy (adjust for your platform)</span>
<span class="nb">echo</span> <span class="s2">"Deploying to staging server..."</span>
<span class="c"># Examples:</span>
<span class="c"># vercel deploy --prod</span>
<span class="c"># aws s3 sync ./dist s3://staging-bucket</span>
<span class="c"># ssh staging "cd /app &amp;&amp; git pull &amp;&amp; pm2 restart app"</span>

<span class="nb">echo</span> <span class="s2">"✅ Deployed to staging!"</span>
</code></pre></div></div>

<h2 id="making-skills-discoverable">Making Skills Discoverable</h2>

<p>Claude automatically discovers skills in <code class="language-plaintext highlighter-rouge">.claude/commands/</code>, but you can help both Claude and your team by documenting them.</p>

<p>Create <code class="language-plaintext highlighter-rouge">.claude/commands/README.md</code>:</p>

<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gh"># Project Skills</span>

Quick reference for available skills:

<span class="gu">## Development</span>
<span class="p">-</span> <span class="sb">`/test`</span> - Run test suite with coverage
<span class="p">-</span> <span class="sb">`/quality`</span> - Run all quality checks (lint, typecheck, test)
<span class="p">-</span> <span class="sb">`/quick-fix`</span> - Auto-fix linting and formatting issues

<span class="gu">## Git Workflow</span>
<span class="p">-</span> <span class="sb">`/commit`</span> - Smart commit with tests and message suggestions
<span class="p">-</span> <span class="sb">`/status`</span> - Show comprehensive project status

<span class="gu">## Deployment</span>
<span class="p">-</span> <span class="sb">`/deploy-staging`</span> - Deploy to staging environment
<span class="p">-</span> <span class="sb">`/deploy-prod`</span> - Deploy to production (requires confirmation)

<span class="gu">## Usage</span>
Type <span class="sb">`/`</span> in Claude Code to see available skills with autocomplete.
</code></pre></div></div>

<h2 id="advanced-skills-with-parameters">Advanced: Skills with Parameters</h2>

<p>Skills can accept parameters, though the syntax is bash-standard rather than special:</p>

<p><code class="language-plaintext highlighter-rouge">.claude/commands/test-file.sh</code>:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>
<span class="c"># Run tests for a specific file</span>
<span class="c"># Usage: /test-file path/to/test</span>

<span class="k">if</span> <span class="o">[</span> <span class="nt">-z</span> <span class="s2">"</span><span class="nv">$1</span><span class="s2">"</span> <span class="o">]</span><span class="p">;</span> <span class="k">then
    </span><span class="nb">echo</span> <span class="s2">"Usage: /test-file &lt;path-to-test-file&gt;"</span>
    <span class="nb">exit </span>1
<span class="k">fi

</span><span class="nv">TEST_FILE</span><span class="o">=</span><span class="s2">"</span><span class="nv">$1</span><span class="s2">"</span>

<span class="k">if</span> <span class="o">[</span> <span class="o">!</span> <span class="nt">-f</span> <span class="s2">"</span><span class="nv">$TEST_FILE</span><span class="s2">"</span> <span class="o">]</span><span class="p">;</span> <span class="k">then
    </span><span class="nb">echo</span> <span class="s2">"❌ File not found: </span><span class="nv">$TEST_FILE</span><span class="s2">"</span>
    <span class="nb">exit </span>1
<span class="k">fi

</span><span class="nb">echo</span> <span class="s2">"Running tests in </span><span class="nv">$TEST_FILE</span><span class="s2">..."</span>
npm <span class="nb">test</span> <span class="nt">--</span> <span class="s2">"</span><span class="nv">$TEST_FILE</span><span class="s2">"</span>
</code></pre></div></div>

<p>Claude can invoke this with: <code class="language-plaintext highlighter-rouge">/test-file src/utils/helper.test.ts</code></p>

<h2 id="best-practices">Best Practices</h2>

<p>After creating dozens of skills, here are the patterns that work best:</p>

<h3 id="1-start-small">1. Start Small</h3>
<p>Don’t try to automate everything at once. Start with your most repetitive task and build from there.</p>

<h3 id="2-make-them-idempotent">2. Make Them Idempotent</h3>
<p>Skills should be safe to run multiple times. If something is already done, they should recognize that and skip it.</p>

<h3 id="3-fail-fast-with-clear-messages">3. Fail Fast with Clear Messages</h3>
<p>If something is wrong, exit immediately with a clear explanation of what and why.</p>

<h3 id="4-chain-skills">4. Chain Skills</h3>
<p>Skills can call other skills. Your <code class="language-plaintext highlighter-rouge">/commit</code> skill calls <code class="language-plaintext highlighter-rouge">/test</code>, for example.</p>

<h3 id="5-version-control-them">5. Version Control Them</h3>
<p>Commit your <code class="language-plaintext highlighter-rouge">.claude/commands/</code> directory. When teammates clone the repo, they get all your skills immediately.</p>

<h2 id="when-to-create-a-skill-vs-using-a-script">When to Create a Skill vs. Using a Script</h2>

<p><strong>Create a skill when:</strong></p>
<ul>
  <li>You or Claude do it more than twice a week</li>
  <li>It has project-specific context</li>
  <li>The team should use the same approach</li>
  <li>You want it discoverable via <code class="language-plaintext highlighter-rouge">/</code> autocomplete</li>
</ul>

<p><strong>Use a regular script when:</strong></p>
<ul>
  <li>It’s one-off or rarely used</li>
  <li>It’s system-level, not project-specific</li>
  <li>It needs to be called from other scripts programmatically</li>
</ul>

<h2 id="real-world-impact">Real-World Impact</h2>

<p>After implementing a suite of skills, you’ll notice:</p>

<ol>
  <li><strong>Faster Development</strong> - No more typing out test commands or remembering flags</li>
  <li><strong>Consistent Workflows</strong> - Everyone on the team runs tests the same way</li>
  <li><strong>Better AI Collaboration</strong> - Claude can execute complex workflows autonomously</li>
  <li><strong>Documentation in Code</strong> - Your skills become living documentation of how things should be done</li>
</ol>

<p>The Boris Cherny workflow I detailed in my <a href="/orchestrating-agents-claude.html">multi-agent article</a> relies heavily on well-designed skills. His team uses them dozens of times per day, and they’re a key reason multiple Claude agents can work effectively in parallel.</p>

<h2 id="getting-started-today">Getting Started Today</h2>

<p>Here’s your action plan:</p>

<ol>
  <li><strong>Create the directory</strong>: <code class="language-plaintext highlighter-rouge">mkdir -p .claude/commands</code></li>
  <li><strong>Create your first skill</strong>: Start with <code class="language-plaintext highlighter-rouge">/test</code> using the example above</li>
  <li><strong>Make it executable</strong>: <code class="language-plaintext highlighter-rouge">chmod +x .claude/commands/test.sh</code></li>
  <li><strong>Try it</strong>: Type <code class="language-plaintext highlighter-rouge">/test</code> in Claude Code</li>
  <li><strong>Commit it</strong>: <code class="language-plaintext highlighter-rouge">git add .claude/ &amp;&amp; git commit -m "feat: add test skill"</code></li>
</ol>

<p>Within a week of using your first skill, you’ll identify three more to create. Within a month, you’ll have a suite of skills that fundamentally change how you develop.</p>

<p>The time investment is minimal (10-15 minutes per skill), but the compounding returns are substantial. Each skill you create makes both you and Claude more effective at your specific workflow.</p>

<hr />

<p><strong>My Claude Code Skills Repo to get you started:</strong>
If you want to give this a shot, I created a repo with a few skills I use daily. You can fork it and add it to your project. As always let me know what you think.</p>
<ul>
  <li><a href="https://github.com/angakh/claude-code-skills">https://github.com/angakh/claude-code-skills</a></li>
</ul>

<p><strong>Related Reading:</strong></p>
<ul>
  <li><a href="/orchestrating-agents-claude.html">The Multi-Agent Approach: How Claude Code’s Creator Uses the Tool</a></li>
</ul>]]></content><author><name>Vatché</name></author><category term="AI" /><category term="Development" /><category term="Claude Code" /><category term="Claude Code" /><category term="Development &amp; DevOps" /><category term="Workflow Automation" /><summary type="html"><![CDATA[A practical guide to creating custom skills (slash commands) in Claude Code to automate your development workflows and make Claude more effective at your specific tasks.]]></summary></entry><entry><title type="html">The ‘95% AI Failure’ Headlines: How Nuanced Research Became Sensational News</title><link href="https://vatchechamlian.com/nada.html" rel="alternate" type="text/html" title="The ‘95% AI Failure’ Headlines: How Nuanced Research Became Sensational News" /><published>2025-08-22T00:00:00+00:00</published><updated>2025-08-22T00:00:00+00:00</updated><id>https://vatchechamlian.com/nada</id><content type="html" xml:base="https://vatchechamlian.com/nada.html"><![CDATA[<p>You’ve probably seen the headlines from Fortune, Yahoo Finance, and others: “MIT Study Shows 95% of AI Pilots Fail.” It’s been making rounds across tech media, spooking investors, and reinforcing every AI skeptic’s worldview. But when you dig into the actual research from MIT’s NANDA project, a different story emerges—one that reveals more about sensationalist journalism than AI failure rates.</p>

<p>When you’re reading something (yes, even this post), you should always question the sources. Who is saying what? What is their agenda? For example, if an oil company is releasing their research on climate impact, you should be skeptical, right?</p>

<blockquote>
  <p>“I completely trust BP Oil’s research on the impact of their 87-day crude oil spill in the Gulf of Mexico.”<br />
— No One Ever</p>
</blockquote>

<h2 id="who-or-what-is-nanda">Who or What is NANDA?</h2>

<p>NANDA is a research project at MIT focused on building decentralized AI infrastructure. NANDA stands for Networked Agents and Decentralized AI, which I believe should have the acronym NADA (I’m just being a punk). This is actually important technology that’s going to be essential for the future of agentic AI. Think of it as DNS for AI agents—NANDA is working on what they call the “NANDA Index Quilt.”</p>

<p>What is the NANDA Index Quilt?</p>

<blockquote>
  <p>“agents, resources, and tools across platforms, organizations and protocols. Through such an approach, we allow for global interoperability, discoverability, and flexible governance of agents”<br />
— p.2, Beyond DNS: Unlocking the Internet of AI Agents via the NANDA Index and Verified AgentFacts</p>
</blockquote>

<h2 id="the-methodology-raises-questions">The Methodology Raises Questions</h2>

<p>When NANDA’s actual report surfaced, some issues became apparent:</p>

<p><strong>Small Sample Size</strong>: Despite claims of analyzing “300 public implementations,” the real methodology reveals just 52 interviews and 153 survey responses. That’s not exactly a comprehensive industry survey.</p>

<p><strong>Strict Success Definition</strong>: They define “failure” as lacking “rapid revenue acceleration” or “measurable P&amp;L impact.” This excludes efficiency gains, process improvements, cost savings, and capability building—basically any outcome that isn’t immediate revenue growth. For example, I helped a Fortune 100 company with an audit issue. What normally took four people 2-3 months can now be done in under ten minutes with a customized AI agent. Those four people are now free to work on more important things, but apparently that doesn’t count as “success.”</p>

<p><strong>Selective Focus</strong>: Their own data shows a 67% success rate for purchased AI solutions and documents companies achieving “$2-10M annually” in savings. Yet somehow this becomes a “95% failure” narrative in the headlines.</p>

<h2 id="what-other-research-is-showing">What Other Research Is Showing</h2>

<p>While NANDA’s small sample painted a mixed picture, larger, more comprehensive studies tell a different story:</p>

<p><strong>Deloitte’s State of GenAI Report:</strong></p>
<ul>
  <li>74% of organizations’ most advanced AI initiatives meet or exceed ROI expectations</li>
  <li>20% report ROI exceeding 30%</li>
  <li><a href="https://www2.deloitte.com/us/en/pages/consulting/articles/state-of-generative-ai-in-enterprise.html">State of Generative AI in the Enterprise 2024</a></li>
</ul>

<p><strong>McKinsey’s Global AI Survey:</strong></p>
<ul>
  <li>71% of organizations regularly use GenAI across 1,491 participants in 101 countries</li>
  <li>Majority report cost reductions within business units</li>
  <li>Companies with systematic approaches show significantly higher success rates</li>
  <li><a href="https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai">The state of AI: How organizations are rewiring to capture value</a></li>
</ul>

<p><strong>Boston Consulting Group’s Study:</strong></p>
<ul>
  <li>Companies investing over $50 million in GenAI show significantly higher success rates</li>
  <li>Success correlates with systematic upskilling and strategic implementation</li>
  <li><a href="https://www.bcg.com/publications/2024/from-potential-to-profit-with-genai">BCG AI Radar: From Potential to Profit with GenAI</a></li>
</ul>

<p>These studies, with much larger sample sizes and longer observation periods, suggest the reality is far more positive than these click-bait headlines implied.</p>

<h2 id="what-the-data-actually-shows">What the Data Actually Shows</h2>

<p>Here’s what NANDA’s research actually reveals:</p>

<ul>
  <li><strong>90% of employees</strong> regularly use AI tools for work (the “shadow AI economy”)</li>
  <li><strong>External partnerships</strong> achieve 67% success rates vs 33% for internal builds</li>
  <li><strong>Multiple documented cases</strong> of companies achieving millions in measurable savings</li>
  <li><strong>Main barriers are organizational</strong>, not technological</li>
</ul>

<p>This paints a picture of widespread AI adoption with clear best practices emerging, not the catastrophic failure narrative that made headlines. Please note, the catastrophe narrative came from media outlets that grabbed one statistic (95% don’t show rapid revenue acceleration) and turned it into “AI is failing everywhere.” That’s not what NANDA said, but it’s what gets clicks.</p>

<h2 id="the-real-issue-media-sensationalism">The Real Issue: Media Sensationalism</h2>

<p>To be honest, most of what I initially thought was bias was actually something simpler: lazy journalism. NANDA’s research, while flawed, isn’t portraying a catastrophe. They’re documenting what they call the “GenAI Divide”…some organizations succeed, most struggle with implementation.</p>

<p>The catastrophe narrative came from media outlets that grabbed one statistic (95% don’t show rapid revenue acceleration) and turned it into “AI is failing everywhere.” That’s not what NANDA said, but it’s what gets clicks.</p>

<h2 id="the-bigger-picture">The Bigger Picture</h2>

<p>This bothers me for two reasons:</p>

<ol>
  <li><strong>How quickly media turned nuanced research into clickbait headlines</strong></li>
  <li><strong>How a small, obviously limited sample got extrapolated to industry-wide conclusions</strong></li>
</ol>

<p>Do you know how many companies are using AI right now? All of them. A study of 52 companies doesn’t represent that reality.</p>

<p>As someone who’s spent years advocating for democratized technology, sloppy research and sensationalist reporting undermine trust in both academia and the technology industry.</p>

<h2 id="the-bottom-line">The Bottom Line</h2>

<p>Before you make any strategic decisions based on headlines about “95% failure,” consider reading the actual source. Just because Fortune regurgitated an article doesn’t mean their clickbait headline reflects reality.</p>

<p>The real story is more nuanced: AI adoption is massive, external partnerships work better than internal builds, and organizations are achieving meaningful value when they approach implementation strategically. That’s not as dramatic as “95% failure,” but it’s the truth and a lot more useful for making actual business decisions. If someone tells you “it’s just plug ‘n play”, run.</p>

<p>Do your own research. Look at the methodology. And remember that clickbait headlines are often just that: headlines, not truth.</p>

<hr />

<p><em>This analysis is based on NANDA’s own published research and publicly available information about their institutional partnerships and commercial interests.</em></p>

<p>Here are some of the other articles I read to write this post (which is why it took me a week to post this!):</p>

<ul>
  <li><a href="https://aicommission.org/2025/08/mit-report-95-of-generative-ai-pilots-at-companies-are-failing/">MIT report: 95% of generative AI pilots at companies are failing</a></li>
  <li><a href="https://fortune.com/2025/08/21/an-mit-report-that-95-of-ai-pilots-fail-spooked-investors-but-the-reason-why-those-pilots-failed-is-what-should-make-the-c-suite-anxious/">An MIT report that 95% of AI pilots fail spooked investors. But it’s the reason why those pilots failed that should make the C-suite anxious</a></li>
  <li><a href="https://nanda.media.mit.edu/">MIT’s NANDA’s Website</a></li>
  <li><a href="https://nanda.media.mit.edu/assets/pdf/nanda-whitepaper.pdf">MIT’s NANDA’s Whitepaper</a></li>
  <li><a href="https://nanda.media.mit.edu/decentralized_AI_perspective.pdf">NANDA’s paper on decentralized AI tech</a></li>
  <li><a href="https://arxiv.org/pdf/2507.14263">NANDA’s paper on the NANDA Index Quilt</a></li>
</ul>]]></content><author><name>Vatché</name></author><category term="Opinion" /><category term="AI" /><category term="Business" /><category term="Research" /><category term="Artificial Intelligence" /><category term="Business &amp; Innovation" /><category term="Media Analysis" /><category term="Research Methodology" /><category term="AI Adoption" /><summary type="html"><![CDATA[95% failure rate in gen AI pilots? For real? Not really, no.]]></summary></entry><entry><title type="html">Full Page Screenshots in Web Browsers</title><link href="https://vatchechamlian.com/fullscreen-screenshots.html" rel="alternate" type="text/html" title="Full Page Screenshots in Web Browsers" /><published>2025-07-27T00:00:00+00:00</published><updated>2025-07-27T00:00:00+00:00</updated><id>https://vatchechamlian.com/fullscreen-screenshots</id><content type="html" xml:base="https://vatchechamlian.com/fullscreen-screenshots.html"><![CDATA[<p>This is not a typical post, but I never knew about this feature and it’s super useful. Have you ever taken multiple screenshots of a long web page? Have you ever printed a web page to pdf so you could share it with someone? Have you ever downloaded an extension to take a full-page screenshot? Well aparently web browsers offer various ways to capture full-page screenshots, allowing you to save entire web pages including content that extends beyond the visible screen area. Without having to jump through hoops.</p>

<h2 id="google-chrome-worst">Google Chrome (worst)</h2>

<p>Chrome provides a hidden full-page screenshot feature within its Developer Tools:</p>

<ol>
  <li><strong>Open Developer Tools</strong>: Press <code class="language-plaintext highlighter-rouge">Ctrl + Shift + I</code> (Windows) or <code class="language-plaintext highlighter-rouge">Cmd + Option + I</code> (Mac)</li>
  <li><strong>Open Command Menu</strong>: Press <code class="language-plaintext highlighter-rouge">Ctrl + Shift + P</code> (Windows) or <code class="language-plaintext highlighter-rouge">Cmd + Shift + P</code> (Mac)</li>
  <li><strong>Run Screenshot Command</strong>: Type “screenshot” and select “Capture full size screenshot”</li>
</ol>

<p>The screenshot will be automatically saved to your default Downloads folder.</p>

<p><img src="./assets/img/posts/20250727/chrome-screenshot.png" alt="Chrome Full Page Screenshot" /></p>

<h2 id="mozilla-firefox-method-two-is-super-easy">Mozilla Firefox (method two is super easy)</h2>

<p>Firefox offers multiple methods for taking full-page screenshots:</p>

<h3 id="method-1-developer-tools">Method 1: Developer Tools</h3>
<ol>
  <li><strong>Open Developer Tools</strong>: Right-click on the page and select “Inspect Element” or press <code class="language-plaintext highlighter-rouge">Ctrl + Shift + I</code> (Windows) or <code class="language-plaintext highlighter-rouge">Cmd + Option + I</code> (Mac)</li>
  <li><strong>Access Settings</strong>: Click the “…” (three-dot) menu in the Developer Tools toolbar</li>
  <li><strong>Enable Screenshot Tool</strong>: Select “Settings” and check “Take a screenshot of the entire page” under “Available Toolbox Buttons”</li>
  <li><strong>Take Screenshot</strong>: Click the camera icon that appears in the toolbar</li>
</ol>

<h3 id="method-2-built-in-screenshot-tool">Method 2: Built-in Screenshot Tool</h3>
<ol>
  <li>Right-click anywhere on the webpage</li>
  <li>Select “Take Screenshot”</li>
  <li>Click “Save full page” in the screenshot interface</li>
  <li>Click “Download” to save the image</li>
</ol>

<p><img src="./assets/img/posts/20250727/firefox-screenshot.png" alt="Firefox Full Page Screenshot" />
<img src="./assets/img/posts/20250727/firefox-screenshot-tool.png" alt="Firefox Screenshot Tool" /></p>

<h2 id="microsoft-edge-easiest">Microsoft Edge (easiest)</h2>

<p>Edge includes a dedicated “Web Capture” feature for screenshots:</p>

<ol>
  <li><strong>Open Web Capture</strong>: Click the menu (…) button and select “Web capture” or press <code class="language-plaintext highlighter-rouge">Ctrl + Shift + S</code></li>
  <li><strong>Capture Full Page</strong>: Select “Capture full page” from the options</li>
  <li><strong>Save or Edit</strong>: Choose to save, copy, or annotate the screenshot before saving</li>
</ol>

<p><img src="./assets/img/posts/20250727/edge-screenshot.png" alt="Edge Screenshot Tool" /></p>

<h2 id="safari">Safari</h2>

<p>Safari on Mac provides screenshot functionality through its developer tools:</p>

<ol>
  <li><strong>Enable Developer Menu</strong>: Go to Safari &gt; Settings &gt; Advanced and check “Show features for web developers”</li>
  <li><strong>Open Web Inspector</strong>: From the Develop menu, select “Show Web Inspector”</li>
  <li><strong>Capture Screenshot</strong>: In the Elements tab, right-click on the <code class="language-plaintext highlighter-rouge">&lt;html&gt;</code> element and select “Capture Screenshot”</li>
</ol>]]></content><author><name>Vatché</name></author><category term="LifeHacks" /><category term="LifeHacks" /><summary type="html"><![CDATA[Did you know you can take full-page screenshots in your browser without the need of an extension?]]></summary></entry><entry><title type="html">Battle of the IDEs*!</title><link href="https://vatchechamlian.com/battle-of-the-ides.html" rel="alternate" type="text/html" title="Battle of the IDEs*!" /><published>2025-07-15T00:00:00+00:00</published><updated>2025-07-15T00:00:00+00:00</updated><id>https://vatchechamlian.com/battle-of-the-ides</id><content type="html" xml:base="https://vatchechamlian.com/battle-of-the-ides.html"><![CDATA[<p>When ChatGPT first came out, one of the first things I did with it was ask it to write me code. This was two years ago and a lot has changed. First it was extensions in VS Code, then it was good extensions in VS Code, then it was GitHub’s Copilot, Cursor, and now Kiro. We are going to cover my impressions of these tools in this article but before we do, I’d like to suggest something for new and junior developers. These tools can accelerate what you do, but you understanding what they do, why it is good, why it is not so good, if it is secure or if the code is highly exploitable…is very important. This comes with experience, but that should not stop you from using these tools. Instead I would recommend that when you are working on something, to ask the assistant to explain it to you. Or maybe try writing it yourself first and then see what changes it makes, ask it why it made those changes. Stay curious.</p>

<p>The tools that I have tested are as follows:</p>
<ul>
  <li>VS Code with Augment</li>
  <li>VS Code with CoPilot</li>
  <li>Cursor</li>
  <li>Kiro</li>
  <li>Claude Code (I know, its not an IDE)</li>
</ul>

<h2 id="vs-code-with-augment-or-copilot">VS Code with Augment or CoPilot</h2>

<p>If you are used to working in VS Code, this is one way you can stay within your comfort zone and try the different AI tools that are at your disposal. Personally I felt that in this match up Augment surpassed CoPilot purely based on its context. The interface with Augment is fairly easy it provides you with explanations using natural language for prompting and inline code changes that look a lot like cherry picking commits. “Do you want to accept this change?” vs it just writing everything for you. As it goes through your code and starts to make “changes” (you have to accept the changes for them to be applied) you get to see what the changes are. At the end of the vibe coding or as Dharmesh calls it “Pair coding” session, it will provide you with an update as to what it did, why it did it, and how you can verify or test the changes. In addition to this, it gives you the option to apply all the changes it has made, or you can open up each file and accept the changes one by one.</p>

<p>CoPilot does very similar things but it had a lot of issues for the problems that I was trying to solve, the context was not there. It was making changes in what seemed like a vacuum.</p>

<h3 id="the-extension-philosophy">The Extension Philosophy</h3>

<p>This represents the “evolution” approach to AI integration—taking what works and gradually adding AI capabilities. VS Code’s strength lies in its massive ecosystem with over 30,000 extensions and millions of users, offering proven stability and extensive customization options. However, these AI features can feel bolted on rather than native, and you’re limited by legacy architecture decisions that weren’t designed with AI first workflows in mind. They are still pretty cool though, so if you are on the fence, grab your favorite drink, load up your calming spotify play list and explore.</p>

<h3 id="note">NOTE:</h3>

<p>Both Augment and CoPilot require network bandwidth for cloud based AI processing, which can introduce latency that affects typing responsiveness. For enterprise development, consider that your code is being sent to external servers, which may raise privacy and security concerns depending on your organization’s requirements.</p>

<h2 id="cursor">Cursor</h2>

<p>The sad thing about VS Code’s implementation is that Cursor feels like what VS Code should have been. You can import your settings from VS Code and now you have what behaves like VS Code but the AI is already built in. Cursor’s context capability and coding is extremely well rounded but like many of the LLMs that try to code, it sometimes falls short. Things like dependencies, even errors in syntax can be common. I am sure as time goes on all of these LLMs will get better but depending on what you are working on, these can be a real hinderance and counter productive. Remember the whole point of using AI in coding is to help you move quickly, but if you are spending more time troubleshooting the code that you didn’t write, that you don’t understand, what are you really doing?</p>

<h3 id="example-of-a-cursor-issue">Example of a Cursor Issue</h3>

<p>I had a json template that I was using to display company data for a demo. I wanted to create a few different companies and use that same JSON template. I asked Cursor to take the data I had and generate a JSON file in the same manner as the one that already existed. This should not be a difficult task, if anything this is the type of stuff you would ask AI to do. Where did it go wrong? Everywhere! The JSON was complete garbage! Not the content, but the structure and syntax were totally broken. This is something that it should have easily picked up on. But it didn’t, it totally missed it.</p>

<h3 id="the-ai-first-philosophy">The AI First Philosophy</h3>

<p>Cursor represents the “revolution” approach, designed from the ground up with AI collaboration in mind. It’s not trying to retrofit AI into an existing editor; it’s built specifically for the AI era. This means:</p>

<ul>
  <li><strong>Cohesive AI integration</strong> throughout the entire development experience</li>
  <li><strong>Modern architecture</strong> optimized for AI features from day one</li>
  <li><strong>Multi-model support</strong> letting you choose between different AI providers</li>
  <li><strong>Seamless AI chat integration</strong> without leaving your coding context</li>
</ul>

<h3 id="the-modern-developer-experience">The Modern Developer Experience</h3>

<p>Cursor feels like what an IDE should be in 2024, with a clean, modern interface that doesn’t feel cluttered with legacy features. The <strong>visual diff interface</strong> shows exactly what the AI wants to change before you accept it, and the <strong>intelligent code editing</strong> can modify existing code based on natural language instructions. However, as noted, the smaller ecosystem and potential for AI generated errors means you need to stay vigilant about code quality. New files/code have to be reviewed thoroughly before committing to a repo, you don’t want to be that guy.</p>

<h2 id="kiro">Kiro</h2>

<p>Kiro is very similar to Cursor in the sense that you install, you import from VS Code (if you want) and it is ready to go. It has a cool splash page when you launch the IDE where you can Vibe or Spec. Vibe is for when you have a particular task you want to accomplish or you want to prototype something. Spec is when you use natural language to plan out what you are trying to do, ideate and then build. This is similar to how replit and lovable approach coding. It is a robust IDE but feels lightweight. But its ability to understand your codebase is not 100% there. For example I asked it to review the code for ieps.ai and provide me with some insight into what my application does. It broke everything down to the best of its ability but it totally missed an entire layer of complexity. I am using S3 buckets for storage and it assumed that everything was stored locally. It also missed lambda function calls and conversational agents.</p>

<p>This really surprised me because I am a huge fan of Claude Code and when it comes to the LLM behind Kiro and Claude Code, they are using the same underlying LLM!</p>

<p>But Kiro has something that I have not really seen in the other IDEs. Agent Hooks, Agent Steering and MCP Servers are available out of the box. You need to connect, but the capability is right there. These additional features can allow for customization on a level that I have not seen yet.</p>

<h3 id="amazons-entry-into-ai-powered-development">Amazon’s Entry into AI Powered Development</h3>

<p>As Amazon’s entry into the AI IDE space, Kiro brings some interesting innovations but also reveals the challenges of building truly context aware development environments. The <strong>Vibe and Spec modes</strong> represent a thoughtful approach to different development workflows. For example if you are in Vibe mode you can quickly prototyping or work on a specific task. In Spec mode you can plan and setup some guidelines prior to starting development. In previous posts regarding vibe coding I have covered how important this is.</p>

<h3 id="context-limitations-and-learning-curve">Context Limitations and Learning Curve</h3>

<p>This really highlights a common challenge across AI powered IDEs. The fact that it missed critical infrastructure components like S3 buckets and Lambda functions means it has a hard time decyphering modern distributed architectures. This emphasizes the importance of maintaining your understanding of what the AI is doing rather than blindly trusting its analysis. If a junior developer asked Kiro what this app was doing, they might try to implement something that is already there, screwing up more of the code and then trying to figure out what went wrong.</p>

<h2 id="claude-code-not-an-ide">Claude Code (not an IDE)</h2>

<p>In my opinion Claude Code is the closest to having someone you trust working next to you. The interface is not ideal for many, its CLI (command line interface), but for someone who grew up on vim, it feels like home. Claudes reasoning, context, and ability to code surpass all of the above options. When it starts to tackle a problem it provides you with an approach and asks if you agree with it before it starts. It shows you how many tokens it is using and provides you with a summary of what it just did. It is awesome! BUT! There are two drawbacks that I hope get resolved soon. For those that are working on Windows OS, they will need to have virtualization enabled, install ubuntu (via microsoft store, not that bad but still its an extra step) and it doesn’t have a chat history. As you reach the limit of its contextual window you will get a context percentage at the bottom of the prompt window and as it starts to countdown you will begin to sweat.</p>

<p>Over a decade ago I committed this code to github: <a href="https://github.com/angakh/screenrc">screenrc</a>. I was working for a presidential campaign in Boston and the internet we had at the headquarters of this campaign was line of sight to the prudential building. This meant that crappy weather would disrupt your connection. If you were ssh’d into a remote box running something and got disconnected your work was lost. So I used “screen” and it was helpful. I am sharing this because if you are one of the unfortunate ones that only has access to a windows machine you may want to run screen in ubuntu and have a few “tabs” open as you work. You can have one for claude code, one for git commits, etc.</p>

<p>If you are on a mac? It is flawless. You can drag and drop images to help guide the front-end development of an app. I really hope for the sake of the Windows users that they offer a better solution to integrating with ClaudeCode, because as of writing this it still feels like the best of the bunch.</p>

<h3 id="the-conversational-approach">The Conversational Approach</h3>

<p>Claude Code represents a fundamentally different philosophy—<strong>conversational interface</strong> that feels like pair programming with an expert rather than just autocomplete with a dash of ADHD and RedBull. The <strong>deep reasoning capabilities</strong> mean it can explain not just what code does, but why, and the <strong>safety focused design</strong> helps avoid common security pitfalls.</p>

<h3 id="terminal-native-benefits-and-challenges">Terminal Native Benefits and Challenges</h3>

<p>The <strong>terminal native approach</strong> makes it lightweight and fast, easily integrated into existing workflows, and perfect for remote development. However, the lack of chat history and context window limitations can be frustrating during longer coding sessions. The Windows setup requirements (virtualization and Ubuntu) add friction for some developers, but for those comfortable with command-line environments, it offers an unparalleled collaborative coding experience.</p>

<h3 id="the-trust-factor">The Trust Factor</h3>

<p>What sets Claude Code apart is its transparency—showing token usage, asking for approval before major changes, and providing clear summaries of actions taken. For me this built a level of trust that I just don’t have with the other IDE’s.</p>

<h2 id="the-real-battle-philosophy-vs-features">The Real Battle: Philosophy vs. Features</h2>

<p>This isn’t just about which tool has the most features…it’s about fundamentally different philosophies of how AI should integrate into development workflows, and from what I’ve seen from hands on experience, each approach has real world implications.</p>

<h3 id="the-extension-approach-vs-code">The Extension Approach (VS Code)</h3>

<p>VS Code represents the “evolution” philosophy: take what works and gradually add AI capabilities. This approach offers:</p>

<p><strong>Pros:</strong></p>
<ul>
  <li>Familiar interface for existing users</li>
  <li>Massive ecosystem and community</li>
  <li>Proven stability and reliability</li>
  <li>Freedom to mix and match AI providers</li>
</ul>

<p><strong>Cons:</strong></p>
<ul>
  <li>AI features can feel bolted-on (as experienced with CoPilot’s context issues)</li>
  <li>Potential for feature conflicts</li>
  <li>Limited by legacy architecture decisions</li>
  <li>Inconsistent AI integration across different extensions</li>
</ul>

<h3 id="the-ai-native-approach-claude-code-cursor-kiro">The AI-Native Approach (Claude Code, Cursor, Kiro)</h3>

<p>These tools represent the “revolution” philosophy: design for AI first workflows from day one.</p>

<p><strong>Pros:</strong></p>
<ul>
  <li>Cohesive AI integration (Cursor’s seamless experience)</li>
  <li>Modern architecture optimized for AI features</li>
  <li>Consistent user experience</li>
  <li>Purpose built for AI collaboration (Claude Code’s conversational approach)</li>
</ul>

<p><strong>Cons:</strong></p>
<ul>
  <li>Smaller ecosystems</li>
  <li>Less proven in production environments</li>
  <li>Potential vendor lock-in</li>
  <li>Learning curve for existing developers</li>
  <li>Context understanding limitations (as seen with Kiro missing infrastructure components)</li>
</ul>

<h2 id="my-take-what-actually-matters">My Take: What Actually Matters</h2>

<p>After testing all these tools extensively, here’s what I’ve learned:</p>

<p><strong>Context is King</strong>: The tools that understand your entire project (like Augment) consistently outperform those that work in isolation. When CoPilot was making changes “in a vacuum,” it became clear that AI without proper context is just fancy autocomplete.</p>

<p><strong>Trust Through Transparency</strong>: Claude Code’s approach of explaining its reasoning and asking for approval creates a collaborative relationship rather than a “just trust me” dynamic. You learn while you code.</p>

<p><strong>The Debugging Tax</strong>: Remember, if you’re spending more time troubleshooting AI generated code that you don’t understand, you’re not actually moving faster. This is especially important for junior developers who might be tempted to accept everything the AI suggests.</p>

<p><strong>Interface Matters Less Than You Think</strong>: While Cursor’s modern UI is appealing, Claude Code’s CLI approach proves that good AI collaboration transcends interface preferences. What matters is the quality of the AI’s reasoning and its ability to work with your specific codebase.</p>

<h3 id="for-individual-developers">For Individual Developers</h3>

<p><strong>Just an idea</strong> - All of the IDE’s mentioned in this post have free tiers. Trying it out has never been easier and I would strongly suggest you create a branch of an existing repo and try to tackle the same problem in each IDE on a branch named after the IDE. Use the same prompt and see what you get.</p>

<p>If you’re comfortable with VS Code and want to start gradually, <strong>Augment</strong> provides the best context aware experience within a familiar environment.</p>

<p>For those ready to embrace AI first development, <strong>Cursor</strong> offers the most polished experience, but be prepared to verify and understand the code it generates.</p>

<p>If you’re comfortable with command line interfaces and prioritize AI reasoning quality over flashy UIs, <strong>Claude Code</strong> provides the most trustworthy collaborative experience.</p>

<h3 id="for-teams-and-learning">For Teams and Learning</h3>

<p><strong>Claude Code</strong> is excellent for learning because it explains its reasoning, making it ideal for junior developers who want to understand not just what code does, but why.</p>

<p><strong>Cursor</strong> works well for teams that need a modern, collaborative environment with good visual feedback on changes.</p>

<p>Avoid tools with poor context understanding (like Kiro missing key infrastructure) for complex projects—the time spent correcting misunderstandings negates the productivity benefits.</p>

<h2 id="the-bottom-line">The Bottom Line</h2>

<p>The “winner” depends entirely on your workflow, experience level, and what you value most. For me it is hands down Claude Code and it’s not even close. But regardless of which tool you choose, remember to <strong>stay curious</strong>, ask the AI to explain its decisions, and never stop understanding what your code actually does.</p>

<p>The real victory isn’t finding the perfect AI assistant…it’s learning to collaborate effectively with AI while maintaining your skills and understanding as a developer.</p>]]></content><author><name>Vatché</name></author><category term="AI" /><category term="Development" /><category term="Artificial Intelligence" /><category term="Development" /><category term="Opinion" /><summary type="html"><![CDATA[A hands-on review of AI-powered IDEs and coding assistants, from incumbents like VS Code to the new Kiro IDE from Amazon.]]></summary></entry><entry><title type="html">Vibe Coding Platforms: The Promise vs. Reality of AI-Powered App Development</title><link href="https://vatchechamlian.com/vibe-coding-reviews.html" rel="alternate" type="text/html" title="Vibe Coding Platforms: The Promise vs. Reality of AI-Powered App Development" /><published>2025-05-29T00:00:00+00:00</published><updated>2025-05-29T00:00:00+00:00</updated><id>https://vatchechamlian.com/vibe-coding-reviews</id><content type="html" xml:base="https://vatchechamlian.com/vibe-coding-reviews.html"><![CDATA[<p>One of the biggest barriers to developing applications has always been coding. If you have a brilliant idea but lack programming skills, you’d typically need to hire a developer or learn to code yourself. Enter “vibe coding” platforms—AI-powered tools that promise to build applications through natural conversation. But do they live up to the hype?</p>

<p>I’ve spent months testing these platforms, investing real money to access full feature sets across multiple services. From invite-only beta platforms to established players, I tested Envato, Build.ai, Builder.ai, Replit, Lovable, Tempo, Emergent, and several others. Here’s what I discovered about the current state of conversational app development.</p>

<h2 id="the-testing-ground-a-real-world-project">The Testing Ground: A Real-World Project</h2>

<p>For consistency, I asked each platform to build the same application: an event management tool for my wife’s special events role at a local private school. The requirements included OAuth authentication, user profiles, event creation and management, team member invitations, and budget tracking—a reasonably complex application that would test each platform’s capabilities.</p>

<h2 id="what-they-all-get-right-the-magic-of-first-impressions">What They All Get Right: The Magic of First Impressions</h2>

<p>The initial results were genuinely impressive. Every platform I tested could take my description and generate a mostly functional application from a single prompt. Within minutes, I had working prototypes with:</p>

<ul>
  <li>OAuth authentication systems</li>
  <li>User profiles and management</li>
  <li>Event creation and editing interfaces</li>
  <li>Team invitation functionality</li>
  <li>Budget tracking components</li>
  <li>Basic responsive design</li>
</ul>

<p>This first iteration capability is transformative. For rapid prototyping or proof-of-concept development, these tools are unmatched. The speed from idea to working demo is remarkable and represents a genuine breakthrough in application development accessibility.</p>

<h2 id="the-pricing-puzzle-different-models-different-pain-points">The Pricing Puzzle: Different Models, Different Pain Points</h2>

<p>The platforms take notably different approaches to monetization:</p>

<h3 id="credit-based-systems-emergent">Credit-Based Systems (Emergent)</h3>
<ul>
  <li>Pay-per-use model with real-time credit consumption</li>
  <li>Deployment costs ~50 credits ($20 USD)</li>
  <li>Costs escalate quickly with iterations</li>
  <li>Transparent but expensive for extensive development</li>
</ul>

<h3 id="subscription--microtransactions-replit">Subscription + Microtransactions (Replit)</h3>
<ul>
  <li>$25/month base subscription</li>
  <li>Agent checkpoints: $0.25 each</li>
  <li>Assistant checkpoints: $0.05 each</li>
  <li>Free deployments</li>
  <li>Occasionally waives fees for AI-caused errors</li>
</ul>

<h3 id="the-hidden-truth-builderais-revelation">The Hidden Truth: Builder.ai’s Revelation</h3>
<p>One particularly eye-opening discovery was Builder.ai, which marketed itself as an AI coding platform but actually employed human developers working behind the scenes. This “smoke and mirrors” approach highlights the importance of understanding what’s actually powering these platforms.</p>

<h2 id="the-critical-flaw-where-ai-development-breaks-down">The Critical Flaw: Where AI Development Breaks Down</h2>

<p>Here’s where every platform I tested failed: <strong>iteration and feature addition</strong>. The moment you try to modify or extend the initial application, the AI systems struggle with code organization and context management. I encountered numerous examples of this breakdown:</p>

<h3 id="case-study-1-the-svg-disaster">Case Study 1: The SVG Disaster</h3>
<p>When requesting an update to an SVG code snippet, one platform generated malformed code with a closing tag <code class="language-plaintext highlighter-rouge">&lt;/svg&gt;vg&gt;</code>, causing compilation errors that required additional credits to resolve.</p>

<h3 id="case-study-2-the-button-color-catastrophe">Case Study 2: The Button Color Catastrophe</h3>
<p>I requested simple form validation that would change a button’s color to green when all fields were completed. The AI successfully implemented this feature, but somehow broke:</p>
<ul>
  <li>Login functionality</li>
  <li>File upload capabilities</li>
  <li>User account creation</li>
  <li>API endpoint connections</li>
  <li>Nearly every other system component</li>
</ul>

<p>The button turned green perfectly, but the application became unusable.</p>

<h2 id="the-solution-hybrid-development-approach">The Solution: Hybrid Development Approach</h2>

<p>The most effective strategy I discovered combines these platforms’ strengths with traditional development tools:</p>

<ol>
  <li><strong>Use vibe coding for rapid prototyping</strong></li>
  <li><strong>Export to GitHub</strong> (most platforms offer this)</li>
  <li><strong>Continue development locally</strong> with traditional AI coding assistants</li>
  <li><strong>Leverage better context windows</strong> in tools like Claude, Copilot, or Cursor</li>
</ol>

<p>This approach gives you the speed of initial AI generation with the control and reliability of established development workflows.</p>

<h2 id="platform-comparison">Platform Comparison</h2>

<table>
  <thead>
    <tr>
      <th>Platform</th>
      <th>Pricing Model</th>
      <th>GitHub Integration</th>
      <th>Best For</th>
      <th>Major Limitations</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Replit</td>
      <td>$25/mo + checkpoints</td>
      <td>✅</td>
      <td>Full development cycle</td>
      <td>Complex feature additions</td>
    </tr>
    <tr>
      <td>Emergent</td>
      <td>Credit-based (~$20/deploy)</td>
      <td>✅</td>
      <td>One-off prototypes</td>
      <td>Cost escalation</td>
    </tr>
    <tr>
      <td>Tempo</td>
      <td>Subscription</td>
      <td>✅</td>
      <td>Rapid prototyping</td>
      <td>Limited customization</td>
    </tr>
    <tr>
      <td>Lovable</td>
      <td>Subscription</td>
      <td>✅</td>
      <td>UI-focused apps</td>
      <td>Limited backend complexity</td>
    </tr>
  </tbody>
</table>

<h2 id="replit-the-developers-choice">Replit: The Developer’s Choice</h2>

<p>Among all platforms tested, Replit emerged as the most developer-friendly option. It provides:</p>

<ul>
  <li><strong>Integrated development environment</strong> with full IDE capabilities</li>
  <li><strong>Built-in deployment pipeline</strong> with autoscaling</li>
  <li><strong>Complete GitHub integration</strong> and management</li>
  <li><strong>Object storage and database</strong> creation tools</li>
  <li><strong>Comprehensive logging and console</strong> access</li>
  <li><strong>Cost bypass mechanism</strong> through GitHub sync (changes pulled from GitHub don’t count as checkpoints)</li>
</ul>

<p>However, Replit has limitations with complex operations like PDF processing, where external services (Lambda functions with S3 storage) become necessary.</p>

<h2 id="pro-tips-for-ai-assisted-coding">Pro Tips for AI-Assisted Coding</h2>

<p>Through extensive testing, I’ve identified several strategies that dramatically improve results:</p>

<h3 id="1-code-organization-strategy">1. Code Organization Strategy</h3>
<p>AI models tend to cram everything into single files. Use this prompt to improve maintainability:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Please review [filename].tsx and break up the functions into separate files. 
I'd like to organize the code into these categories: services, handlers, 
endpoints, and middleware. Each category should be in its own file.
</code></pre></div></div>

<h3 id="2-planning-before-coding">2. Planning Before Coding</h3>
<p>Always establish approach before implementation:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Do not write any code yet. First, provide me with your approach to 
[describe your goal] and ask me if I agree with it. We should ensure 
we're in agreement before writing any code.
</code></pre></div></div>

<h3 id="3-controlled-implementation">3. Controlled Implementation</h3>
<p>Once you’ve agreed on the approach:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>I agree with this approach. Please update only one file at a time and 
ask if I have questions about the changes. If I don't have questions, 
you can write [Option A: the complete file] or [Option B: only the 
changed code with clear start/end markers].
</code></pre></div></div>

<p><strong>Option A</strong> works better for non-integrated tools
<strong>Option B</strong> enables longer conversations by conserving context window</p>

<h2 id="ide-integration-augment-leads-the-pack">IDE Integration: Augment Leads the Pack</h2>

<p>For traditional IDE-based development, I recommend <strong>Augment</strong>. It excels at:</p>
<ul>
  <li>Rapid code indexing</li>
  <li>Contextual code suggestions</li>
  <li>Natural language code queries</li>
  <li>Seamless VS Code integration</li>
</ul>

<h2 id="the-verdict-promise-partially-delivered">The Verdict: Promise Partially Delivered</h2>

<p>Vibe coding platforms represent a genuine breakthrough in application development accessibility, but they’re not the complete solution they promise to be. They excel at:</p>

<ul>
  <li><strong>Rapid prototyping</strong></li>
  <li><strong>Initial application generation</strong></li>
  <li><strong>Lowering barriers to entry</strong></li>
  <li><strong>Proof-of-concept development</strong></li>
</ul>

<p>However, they struggle with:</p>
<ul>
  <li><strong>Iterative development</strong></li>
  <li><strong>Complex feature additions</strong></li>
  <li><strong>Code maintainability</strong></li>
  <li><strong>Context management</strong></li>
</ul>

<h2 id="looking-forward">Looking Forward</h2>

<p>The future likely belongs to hybrid approaches that combine the rapid generation capabilities of vibe coding platforms with the precision and control of traditional development tools. As these platforms mature and improve their iteration capabilities, they may eventually deliver on their full promise.</p>

<p>For now, treat them as powerful prototyping tools that can jumpstart your development process, but be prepared to transition to traditional development methods for serious application building.</p>

<p>The democratization of app development is happening, just not quite as seamlessly as the marketing suggests. The key is understanding these tools’ strengths and limitations, then using them strategically within a broader development workflow. Like I mentioned in my previous post about <a href="https://www.thecodewhisperer.com/are-coding-skills-following-the-typists-path">Are Coding Skills Following the Typists Path</a>, the future belongs to those who can effectively prompt, architect, direct, and integrate with AI tools.</p>

<hr />

<p><em>Have you experimented with vibe coding platforms? Share your experiences and insights in the comments below.</em>
```</p>]]></content><author><name>Vatché</name></author><category term="AI" /><category term="Development" /><category term="No-Code" /><category term="Artificial Intelligence" /><category term="Development &amp; DevOps" /><summary type="html"><![CDATA[A hands-on review of AI-powered vibe coding platforms—testing the promise of building apps through conversation and revealing where they excel and fail.]]></summary></entry><entry><title type="html">From ‘Works on My Machine’ to ‘Works for Everyone’</title><link href="https://vatchechamlian.com/from-works-on-my-machine-to-works-for-everyone.html" rel="alternate" type="text/html" title="From ‘Works on My Machine’ to ‘Works for Everyone’" /><published>2025-05-08T00:00:00+00:00</published><updated>2025-05-08T00:00:00+00:00</updated><id>https://vatchechamlian.com/from-works-on-my-machine-to-works-for-everyone</id><content type="html" xml:base="https://vatchechamlian.com/from-works-on-my-machine-to-works-for-everyone.html"><![CDATA[<p>A long time ago, when I was working in the Drupal CMS space, I was introduced to Lando. It was one of the first times I had seen a docker container impact the workflow of a project. It was not easy to setup initially but in the end the result was so positive, that it could not be ignored. Development environments have evolved significantly since then to solve the “works on my machine” problem.</p>

<p>In this post we will be getting technical, so if that is not your thing, don’t feel bad about hitting the back button.</p>

<tweet>Remember when "works on my machine" was a valid excuse? Docker containers ended that era and we're all better for it.</tweet>

<h2 id="1-docker-and-dev-containers">1. Docker and Dev Containers</h2>

<h3 id="what-are-dev-containers-and-how-do-they-work">What are Dev Containers and how do they work?</h3>

<p>Dev Containers are development environments containerized using Docker that allow developers to use a consistent, pre-configured environment. They encapsulate dependencies, runtimes, and tools needed for development.</p>

<p>Dev Containers work by leveraging Docker’s containerization technology but with a focus on development rather than deployment. When a developer opens a project with Dev Container support (in VS Code or other compatible IDEs), the IDE builds and runs the container, then connects to it for development tasks like editing, debugging, and running code.</p>

<h3 id="how-do-devcontainerdockerfile-and-devcontainerdevcontainerjson-work-together">How do “.devcontainer/Dockerfile” and “.devcontainer/devcontainer.json” work together?</h3>

<p>These two files form the foundation of a Dev Container:</p>

<p><strong>“.devcontainer/Dockerfile”</strong>: Defines the base container image and steps to install required tools and dependencies.</p>

<div class="language-dockerfile highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">FROM</span><span class="s"> python:3.11</span>
<span class="k">RUN </span>apt-get update <span class="o">&amp;&amp;</span> apt-get <span class="nb">install</span> <span class="nt">-y</span> <span class="se">\
</span>    git <span class="se">\
</span>    curl <span class="se">\
</span>    <span class="o">&amp;&amp;</span> <span class="nb">rm</span> <span class="nt">-rf</span> /var/lib/apt/lists/<span class="k">*</span>
<span class="k">RUN </span>pip <span class="nb">install </span>poetry
</code></pre></div></div>

<p><strong>“.devcontainer/devcontainer.json”</strong>: Configures how the Dev Container integrates with the IDE and environment.</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Python Project"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"build"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"dockerfile"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Dockerfile"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"context"</span><span class="p">:</span><span class="w"> </span><span class="s2">".."</span><span class="w">
  </span><span class="p">},</span><span class="w">
  </span><span class="nl">"customizations"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"vscode"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="nl">"extensions"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"ms-python.python"</span><span class="p">,</span><span class="w"> </span><span class="s2">"ms-python.vscode-pylance"</span><span class="p">]</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">},</span><span class="w">
  </span><span class="nl">"forwardPorts"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="mi">8000</span><span class="p">],</span><span class="w">
  </span><span class="nl">"postCreateCommand"</span><span class="p">:</span><span class="w"> </span><span class="s2">"poetry install"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>The Dockerfile builds the container, while the devcontainer.json file configures how the IDE interacts with it, including IDE extensions to install, ports to forward, and commands to run after container creation.</p>

<h3 id="benefits-for-team-collaboration">Benefits for team collaboration</h3>

<p><strong>Consistency</strong>: Every team member works in the exact same environment, eliminating “works on my machine” problems</p>

<p><strong>Onboarding</strong>: New developers can be productive within minutes by simply opening the project in their IDE</p>

<p><strong>Isolation</strong>: Projects with different dependencies don’t conflict with each other</p>

<p><strong>Version control</strong>: The development environment itself is versioned alongside the code</p>

<h3 id="how-they-help-achieve-parity-with-production">How they help achieve parity with production</h3>

<p>Dev Containers can use the same base images as production containers, shared dependencies ensure development behaviors match production, environment variables can be configured similarly to production, and service dependencies (databases, message queues) can be included via Docker Compose.</p>

<tweet>Dev containers don't just solve "works on my machine"—they solve "works exactly like production" too.</tweet>

<h2 id="2-cloud-development-environments">2. Cloud Development Environments</h2>

<h3 id="what-are-cloud-ides-like-googles-project-idx-or-github-codespaces">What are cloud IDEs like Google’s Project IDX or GitHub Codespaces?</h3>

<p>Cloud Development Environments provide fully functional development environments hosted in the cloud and accessible through web browsers or local IDEs. They eliminate the need to set up local development environments completely.</p>

<p><strong>GitHub Codespaces</strong>: Pre-configured cloud environments integrated with GitHub repositories</p>

<p><strong>Google’s Project IDX</strong>: Google’s cloud development platform designed for web and mobile app development</p>

<p><strong>GitPod</strong>: Open source cloud development environments that can integrate with GitHub, GitLab, and Bitbucket</p>

<h3 id="how-do-they-differ-from-local-dev-containers">How do they differ from local Dev Containers?</h3>

<p><strong>Resource allocation</strong>: Cloud environments use cloud resources instead of local computer power</p>

<p><strong>Access</strong>: Accessible from any device with a web browser</p>

<p><strong>Setup time</strong>: Instant access without local Docker installation or configuration</p>

<p><strong>Cost model</strong>: Usually involves usage-based pricing rather than local hardware costs</p>

<p><strong>Performance</strong>: Network latency can affect the development experience</p>

<h3 id="configuration-files-they-use">Configuration files they use</h3>

<p><strong>GitHub Codespaces</strong>: Uses the same “.devcontainer” configuration as local Dev Containers</p>

<p><strong>Project IDX</strong>: Uses “.idx/dev.nix” configuration files based on the Nix package manager</p>

<p>Example “.idx/dev.nix” for Project IDX:</p>

<div class="language-nix highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span> <span class="nv">pkgs</span><span class="p">,</span> <span class="o">...</span> <span class="p">}:</span> <span class="p">{</span>
  <span class="nv">channel</span> <span class="o">=</span> <span class="s2">"stable"</span><span class="p">;</span>
  
  <span class="nv">packages</span> <span class="o">=</span> <span class="p">[</span>
    <span class="nv">pkgs</span><span class="o">.</span><span class="nv">nodejs_20</span>
    <span class="nv">pkgs</span><span class="o">.</span><span class="nv">yarn</span>
    <span class="nv">pkgs</span><span class="o">.</span><span class="nv">python311</span>
  <span class="p">];</span>
  
  <span class="nv">idx</span><span class="o">.</span><span class="nv">extensions</span> <span class="o">=</span> <span class="p">[</span>
    <span class="s2">"dbaeumer.vscode-eslint"</span>
    <span class="s2">"esbenp.prettier-vscode"</span>
  <span class="p">];</span>
  
  <span class="nv">idx</span><span class="o">.</span><span class="nv">previews</span> <span class="o">=</span> <span class="p">{</span>
    <span class="nv">enable</span> <span class="o">=</span> <span class="kc">true</span><span class="p">;</span>
    <span class="nv">previews</span> <span class="o">=</span> <span class="p">[</span>
      <span class="p">{</span>
        <span class="nv">command</span> <span class="o">=</span> <span class="p">[</span><span class="s2">"npm"</span> <span class="s2">"run"</span> <span class="s2">"dev"</span><span class="p">];</span>
        <span class="nv">manager</span> <span class="o">=</span> <span class="s2">"web"</span><span class="p">;</span>
        <span class="nv">id</span> <span class="o">=</span> <span class="s2">"web"</span><span class="p">;</span>
      <span class="p">}</span>
    <span class="p">];</span>
  <span class="p">};</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="advantages-and-limitations">Advantages and limitations</h3>

<p><strong>Advantages:</strong></p>
<ul>
  <li>Work from anywhere with internet access</li>
  <li>No local setup required</li>
  <li>Consistent environment for all team members</li>
  <li>Easily scalable resources for intensive tasks</li>
  <li>Collaboration features like real-time pair programming</li>
</ul>

<p><strong>Limitations:</strong></p>
<ul>
  <li>Requires internet connectivity</li>
  <li>Potential latency issues</li>
  <li>Monthly costs for team usage</li>
  <li>Less control over the underlying infrastructure</li>
  <li>Privacy/security concerns with proprietary code in cloud environments</li>
</ul>

<h2 id="3-other-approaches">3. Other Approaches</h2>

<h3 id="how-do-tools-like-docker-compose-fit-into-development-workflows">How do tools like Docker Compose fit into development workflows?</h3>

<p>Docker Compose allows developers to define and run multi-container Docker applications. It’s often used alongside Dev Containers to set up supporting services needed for development (databases, caches, message queues), create a network of interconnected services that mirror production, and manage environment variables and volumes across multiple containers.</p>

<p>Example “docker-compose.yml”:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">version</span><span class="pi">:</span> <span class="s1">'</span><span class="s">3'</span>
<span class="na">services</span><span class="pi">:</span>
  <span class="na">app</span><span class="pi">:</span>
    <span class="na">build</span><span class="pi">:</span> <span class="s">.</span>
    <span class="na">ports</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s2">"</span><span class="s">8000:8000"</span>
    <span class="na">volumes</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s">.:/app</span>
    <span class="na">depends_on</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s">db</span>
      <span class="pi">-</span> <span class="s">redis</span>
  
  <span class="na">db</span><span class="pi">:</span>
    <span class="na">image</span><span class="pi">:</span> <span class="s">postgres:14</span>
    <span class="na">environment</span><span class="pi">:</span>
      <span class="na">POSTGRES_PASSWORD</span><span class="pi">:</span> <span class="s">devpassword</span>
      <span class="na">POSTGRES_USER</span><span class="pi">:</span> <span class="s">devuser</span>
    <span class="na">volumes</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s">pgdata:/var/lib/postgresql/data</span>
  
  <span class="na">redis</span><span class="pi">:</span>
    <span class="na">image</span><span class="pi">:</span> <span class="s">redis:7</span>
    <span class="na">ports</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s2">"</span><span class="s">6379:6379"</span>
<span class="na">volumes</span><span class="pi">:</span>
  <span class="na">pgdata</span><span class="pi">:</span>
</code></pre></div></div>

<h3 id="differences-between-dev-environments-and-docker-compose">Differences between Dev Environments and Docker Compose</h3>

<p><strong>Dev Containers</strong> focus on the development environment itself (IDE integration, extensions, tools)</p>

<p><strong>Docker Compose</strong> orchestrates multiple services that work together</p>

<p>Dev Containers can integrate with Docker Compose to provide both aspects</p>

<h3 id="role-of-package-managers-like-uv-and-task-runners-like-just">Role of package managers like “uv” and task runners like “just”</h3>

<p><strong>Modern package managers</strong> like “uv” (for Python, written in Rust) improve dependency management speed and reliability. I highly recommend “uv” it is so much faster.</p>

<p><strong>Task runners</strong> like “just” provide a consistent interface for common development tasks</p>

<p>Example “justfile”:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>default:
    @just --list

# Run project unit tests
test:
    uv run -- pytest

# Run MLflow server
mlflow:
    uv run -- mlflow server --host 127.0.0.1 --port 5000

# Serve latest registered model locally
serve:
    uv run -- mlflow models serve -m models:/mymodel/latest -h 0.0.0.0 -p 8080
</code></pre></div></div>

<p>These tools help standardize common development tasks across the team, regardless of the environment they’re working in.</p>

<tweet>Modern dev tools like 'uv' and 'just' make containerized environments feel as smooth as native development—but with way better consistency.</tweet>

<h2 id="4-best-practices">4. Best Practices</h2>

<h3 id="when-to-choose-each-approach">When to choose each approach</h3>

<p><strong>Dev Containers</strong>: For teams with complex development environments who want IDE integration</p>

<p><strong>Cloud Development</strong>: For distributed teams, or when onboarding needs to be extremely fast</p>

<p><strong>Docker Compose</strong>: For applications with multiple interconnected services</p>

<p><strong>Package managers/task runners</strong>: As complementary tools in any environment</p>

<h3 id="ensuring-development-matches-production">Ensuring development matches production</h3>

<p>Use the same base images and version tags when possible, document all dependencies explicitly, use infrastructure-as-code to define both environments, test in a staging environment that mirrors production before deployment, and include all critical services in the development environment.</p>

<h3 id="trade-offs-between-simplicity-and-completeness">Trade-offs between simplicity and completeness</h3>

<p><strong>Simple environments</strong> are faster to set up but may miss edge cases</p>

<p><strong>Complete environments</strong> catch more issues but require more resources and maintenance</p>

<p>Start with the minimal viable environment and incrementally add complexity as needed</p>

<p>Focus on matching the aspects of production that affect development most directly</p>

<h3 id="managing-environment-variables">Managing environment variables</h3>

<p>Use “.env” files for development-specific variables, never commit production secrets to version control, consider tools like “direnv” to manage environment switching, use secret management services for production environments, and define default values in the codebase with clear documentation.</p>

<p>Example approach with “.env.example” and “.gitignore”:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># .env.example (committed to version control)</span>
<span class="nv">DATABASE_URL</span><span class="o">=</span>postgresql://devuser:devpassword@db:5432/devdb
<span class="nv">REDIS_URL</span><span class="o">=</span>redis://redis:6379/0
<span class="nv">API_KEY</span><span class="o">=</span>example_key_for_development

<span class="c"># .gitignore</span>
.env
</code></pre></div></div>

<h2 id="real-world-scenario-full-stack-web-application">Real-world scenario: Full-stack web application</h2>

<p>For a typical full-stack web application with a React frontend, Node.js API, and PostgreSQL database:</p>

<p><strong>Dev Container approach:</strong></p>
<ul>
  <li>“.devcontainer/Dockerfile” with Node.js, PostgreSQL client tools</li>
  <li>“.devcontainer/devcontainer.json” with VS Code extensions for React, Node</li>
  <li>“docker-compose.yml” for PostgreSQL service</li>
</ul>

<p><strong>Cloud IDE approach:</strong></p>
<ul>
  <li>GitHub Codespaces configuration with the same Dev Container setup</li>
  <li>Environment variables set through the Codespaces secrets</li>
</ul>

<p><strong>Local only approach:</strong></p>
<ul>
  <li>“docker-compose.yml” with services for frontend, backend, and database</li>
  <li>Volume mounts for live code reloading</li>
</ul>

<p><strong>Hybrid approach:</strong></p>
<ul>
  <li>Dev Container for the development environment</li>
  <li>Docker Compose for service dependencies</li>
  <li>Task runner (“just” or npm scripts) for common commands</li>
  <li>Environment managed through “.env” files with “.env.example” templates</li>
</ul>

<p>The best solution depends on your or your team’s specific needs, but containerized environments (either local or cloud-based) have been leveraged to ensure consistency and reduce onboarding friction for a while now.</p>

<p>The evolution from “works on my machine” to “works for everyone” represents more than just a technical advancement—it’s a fundamental shift in how we think about development environments. We’ve moved from treating environment setup as a necessary evil to embracing it as a core part of our development workflow.</p>

<p>Whether you choose local dev containers, cloud development environments, or a hybrid approach, the key is consistency and reproducibility. The days of spending hours debugging environment-specific issues are largely behind us, replaced by systems that ensure every developer on your team can be productive from day one.</p>

<p>The infrastructure choices you make today will determine how smoothly your team scales tomorrow. Choose tools that grow with your team and make the complex simple, not the simple complex.</p>

<p>I hope you found this post helpful, thanks for reading.</p>]]></content><author><name>Vatché</name></author><category term="opinion" /><category term="Development &amp; DevOps" /><summary type="html"><![CDATA[Development environments have evolved from 'works on my machine' nightmares to consistent, shareable containers. Here's your guide to Docker dev containers, cloud IDEs, and best practices.]]></summary></entry></feed>