Benchmarking Devin AI

Apr 19

Last night I offered Scott Wu and Elad Gil to Benchmark Cognition’s Devin AI against the test I built

While data is a specialized use case of engineering, it is broad enough that a capable engineering AI should have good competency in it.

To make sure there are no unexpected human-in-the-loop scenarios involved, I set a strict timeline for when the benchmark results are due (Today, 6pm Pacific). This is an extremely lenient timeline if you consider that it should take less than 5 minutes to execute our benchmark.

By now the app has been tested by LangChain Engineer, Alex Kira as well as reviewed by Joe Reis (the man who needs no introduction in the Data circles).

Benchmarking App

Of course, this is not just for Cognition. Anyone with access to Devin, the software, can go to my Streamlit App, download the test questions, run them against the model, and upload their answers back to the app.

This means that even if Cognition were to manually proofread and edit the answers, someone else could re-upload the newly-generated answers from Devin the next day and everyone would see the difference in the leaderboard.

In the meantime, if anyone needs to reach me about any of this:

Email: Simply Hit Reply

LinkedIn: linkedin.com/in/smir

0 Comments

Mir's Data .Report

Benchmarking Devin AI

Last night I offered Scott Wu and Elad Gil to Benchmark Cognition’s Devin AI against the test I built