this post was submitted on 24 Jul 2024
1 points (100.0% liked)

Technology

59587 readers
5370 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


founded 1 year ago
MODERATORS
 

DuckDuckGo, Bing, Mojeek, and other search engines are not returning full Reddit results any more.

you are viewing a single comment's thread
view the rest of the comments
[–] MCasq_qsaCJ_234@lemmy.zip 0 points 4 months ago (1 children)

I have two questions. How much do those bots consume your bandwidth? And by blocking search robots, do you stop being present in the search results or are you still present, but they do not show the content in question?

I ask these questions because I don't know much about the topic when managing a website or an instance of the fediverse.

[–] ptz@dubvee.org 0 points 4 months ago* (last edited 4 months ago) (1 children)

How much do those bots consume your bandwidth?

Pretty negligible per bot per request, but I'm not here to feed them. They also travel in packs, so the bandwidth does multiply. It also costs me money when I exceed my monthly bandwidth quota. I've blocked them for so long, I no longer have data I can tally to get an aggregate total (I only keep 90 days). SemrushBot alone, before I blocked it, was averaging about 15 GB a month. That one is fairly aggressive, though. Imagesift Bot, which pulls down any images it can find, would also use quite a bit, I imagine, if it were allowed.

With Lemmy, especially earlier versions, the queries were a lot more expensive, and bots hitting endpoints that triggered a heavy query (such as a post with a lot of comments) would put unwanted load on my DB server. That's when I started blocking bot crawlers much more aggressively.

Static sites are a lot less impactful, and I usually allow those. I've got a different rule set for them which blocks the known AI scrapers but allows search indexers (though that distinction is slowly disappearing).

And by blocking search robots, do you stop being present in the search results or are you still present, but they do not show the content in question?

I block bots by default, and that prevents them from being indexed since they can't be crawled at all. Searching "dubvee" (my instance name / url) in Google returns no relevant results. I'm okay with that, lol, but some people would be appalled.

However, I can search for things I've posted from my instance if they've federated to another instance that is crawled; the link will just be to the copy on that instance.

For the few static sites I run (mostly local business sites since they'd be on Facebook otherwise), I don't enforce the bot blocking, and Google, etc are able to index them normally.

[–] MCasq_qsaCJ_234@lemmy.zip 0 points 4 months ago

Thanks for the explanation and it was clear to me.