68 points
by
@grace77
|
July 12th, 2025 at 3:07pm
July 12th, 2025 at 7:28pm
This is a surprisingly good idea. The model vs model is fun, but not really that useful.
But this could be a legitimate way to design apps in general if you could tell the models what you liked and didn't like.
July 12th, 2025 at 4:46pm
I tried the vote and both results always suck, there's no option to say neither are winners. Also it seems from the network tab you're sending 4 (or 5?) requests but only displaying the first two that respond, which biases it to the small models that respond more quickly which usually results in showing two bad results
July 13th, 2025 at 1:14am
It would lend credibility to publish your system prompt.
July 12th, 2025 at 11:45pm
interesting idea, this benchmark maps fairly closely to the types of output I typically ask LLMs to generate for me day-to-day
July 12th, 2025 at 6:02pm
nice! Training models using reward signals for code correctness is obviously very common; I'm very curious to see how good things can get using a reward signal obtained from visual feedback
July 12th, 2025 at 10:28pm
Very cool! Can the code and design that is generated be used?
July 12th, 2025 at 4:05pm
[dead]
July 12th, 2025 at 8:41pm
[flagged]
July 12th, 2025 at 5:22pm
[flagged]
@coryvirok
July 12th, 2025 at 4:28pm
This is really good! It would be really cool to somehow get human designs in the mix to see how the models compare. I bet there are curated design datasets with descriptions that you could pass to each of the models and then run voting as a "bonus" question (comparing the human and AI generated versions) after the normal genAI voting round.