Faking a JPEG

379 points

by

@todsacerdoti

|

July 11th, 2025 at 10:57pm

@tomsmeding

July 12th, 2025 at 7:47am

They do have a robots.txt [1] that disallows robot access to the spigot tree (as expected), but removing the /spigot/ part from the URL seems to still lead to Spigot. [2] The /~auj namespace is not disallowed in robots.txt, so even well-intentioned crawlers, if they somehow end up there, can get stuck in the infinite page zoo. That's not very nice.

[1]: https://www.ty-penguin.org.uk/robots.txt

[2]: https://www.ty-penguin.org.uk concatenated with /~auj/cheese (don't want to create links there)

@jandrese

July 12th, 2025 at 5:15am

I wonder if you could mess with AI input scrapers by adding fake captions to each image? I imagine something like:

    (big green blob)

    "My cat playing with his new catnip ball".


    (blue mess of an image)

    "Robins nesting"

@marcod

July 12th, 2025 at 1:57am

Reading about Spigot made me remember https://www.projecthoneypot.org/

I was very excited 20 years ago, every time I got emails from them that the scripts and donated MX records on my website had helped catching a harvester

> Regardless of how the rest of your day goes, here's something to be happy about -- today one of your donated MXs helped to identify a previously unknown email harvester (IP: 172.180.164.102). The harvester was caught a spam trap email address created with your donated MX:

@Szpadel

July 12th, 2025 at 10:29am

the worst offender I saw is meta.

they have facebookexternalhit bot (they sometimes use default python request user agent) that (as they documented) explicitly ignores robots.txt

it's (as they say) used to validate links if they contain malware. But if someone would like to serve malware the first thing they would do would be to serve innocent page to facebook AS and their user agent.

they also re-check every URL every month to validate if this still does not contain malware.

the issue is as follows some bad actors spam Facebook with URLs to expensive endpoints (like some search with random filters) and Facebook provides then with free ddos service for your competition. they flood you with > 10 r/s for days every month.

@mrbluecoat

July 11th, 2025 at 11:52pm

> I felt sorry for its thankless quest and started thinking about how I could please it.

A refreshing (and amusing) attitude versus getting angry and venting on forums about aggressive crawlers.

@EspadaV9

July 12th, 2025 at 12:11am

I like this one

https://www.ty-penguin.org.uk/~auj/spigot/pics/2025/03/25/fa...

Some kind of statement piece

@kazinator

July 12th, 2025 at 8:57am

Faking a JPEG is not only less CPU intensive than making one properly, but by doing os you are fuzzing whatever malware is on the other end; if it is decoding the JPEG and isn't robust, it may well crash.

@derefr

July 12th, 2025 at 1:21am

> It seems quite likely that this is being done via a botnet - illegally abusing thousands of people's devices. Sigh.

Just because traffic is coming from thousands of devices on residential IPs, doesn't mean it's a botnet in the classical sense. It could just as well be people signing up for a "free VPN service" — or a tool that "generates passive income" for them — where the actual cost of running the software, is that you become an exit node for both other "free VPN service" users' traffic, and the traffic of users of the VPN's sibling commercial brand. (E.g. scrapers like this one.)

This scheme is known as "proxyware" — see https://www.trendmicro.com/en_ca/research/23/b/hijacking-you...

@superjan

July 12th, 2025 at 6:12am

There is a particular pattern (block/tag marker) that is illegal the compressed JPEG stream. If I recall correctly you should insert a 0x00 after a 0xFF byte in the output to avoid it. If there is interest I can followup later (not today).

@mhuffman

July 12th, 2025 at 5:16pm

I don't understand the reasoning behind the "feed them a bunch of trash" option when it seems that if you identify them (for example by ignoring a robots.txt file) you can just keep them hung up on network connections or similar without paying for infinite garbage for crawlers to injest.

@bschwindHN

July 11th, 2025 at 11:51pm

You should generate fake but believable EXIF data to go along with your JPEGs too.

@112233

July 12th, 2025 at 5:42am

So how do I set up an instance of this beautiful flytrap? Do I need a valid personal blog, or can I plop something on cloudflare to spin on their edge?

@Modified3019

July 12th, 2025 at 2:22am

Love the effort.

That said, these seem to be heavily biased towards displaying green, so one “sanity” check would be if your bot is suddenly scraping thousands of green images, something might be up.

@thayne

July 12th, 2025 at 5:33pm

I'm curious how the author identifies the crawlers that use random User Agents and and distinct ip addresses per request. Is there some other indicator that can be used to identify them?

On a different note, if the goal is to waste resources for the bot, on potential improvement could be to uses very large images with repeating structure that compress extremely well as jpegs for the templates, so that it takes more ram and cpu to decode them with relatively little cpu and ram required to generate them and bandwidth to transfer them.

@time0ut

July 12th, 2025 at 2:56pm

JPEG is fascinating and quite complex. Here is a really excellent high level explanation of how it works:

https://www.youtube.com/watch?v=0me3guauqOU

@lblume

July 11th, 2025 at 11:43pm

Given that current LLMs do not consistently output total garbage, and can be used as judges in a fairly efficient way, I highly doubt this could even in theory have any impact on the capabilities of future models. Once (a) models are capable enough to distinguish between semi-plausible garbage and possibly relevant text and (b) companies are aware of the problem, I do not think data poisoning will be an issue at all.

@jeroenhd

July 12th, 2025 at 1:10pm

This makes me wonder if there are more efficient image formats that one might want to feed botnets. JPEG is highly complex, but PNG uses a relatively simple DEFLATE stream as well as some basic filters. Perhaps one could make a zip-bomb like PNG that only consists of a few bytes?

@sim7c00

July 12th, 2025 at 5:15pm

love how u speak about pleasing bots and them getting excited :D fun read, fun project. thanks!

@larcanio

July 12th, 2025 at 3:11pm

Happy to realize real heros does exists.

@a-biad

July 12th, 2025 at 12:37pm

I am bit confused about the context. What is exactly the point of exposing fake data to webcrawlers?

@puttycat

July 12th, 2025 at 2:01am

> compression tends to increase the entropy of a bit stream.

Does it? Encryption increases entropy, but not sure about compression.

@BubbleRings

July 12th, 2025 at 10:58am

Is there reason you couldn’t generate your images by grabbing random rectangles of pixels from one source image and pasting it into a random location in another source image? Then you would have a fully valid jpg that no AI could easily successfully identify as generated junk. I guess that would require much more CPU than your current method huh?

@dheera

July 12th, 2025 at 12:08am

> So the compressed data in a JPEG will look random, right?

I don't think JPEG data is compressed enough to be indistinguishable from random.

SD VAE with some bits lopped off gets you better compression than JPEG and yet the latents don't "look" random at all.

So you might think Huffman encoded JPEG coefficients "look" random when visualized as an image but that's only because they're not intended to be visualized that way.

@bvan

July 12th, 2025 at 4:52pm

Love this.

@ardme

July 12th, 2025 at 4:41pm

Old man yells at cloud, then creates a labyrinth of mirrors for the images of the clouds to reflect back on each other.

@jekwoooooe

July 12th, 2025 at 11:05am

It’s our moral imperative to make crawling cost prohibitive and also poison LLM training.

@hashishen

July 12th, 2025 at 1:03am

the hero we needed and deserved

@Domainzsite

July 12th, 2025 at 2:11pm

This is pure internet mischief at its finest. Weaponizing fake JPEGs with valid structure and random payloads to burn botnet cycles? Brilliant. Love the tradeoff thinking: maximize crawler cost, minimize CPU. The Huffman bitmask tweak is chef’s kiss. Spigot feels like a spiritual successor to robots.txt flipping you off in binary.