I’m not understanding what stops a search engine from scraping a publicly accessible website. ?
Technology
This is a most excellent place for technology news and articles.
Our Rules
- Follow the lemmy.world rules.
- Only tech related content.
- Be excellent to each another!
- Mod approved content bots can post up to 10 articles per day.
- Threads asking for personal tech support may be deleted.
- Politics threads may be removed.
- No memes allowed as posts, OK to post as comments.
- Only approved bots from the list below, to ask if your bot can be added please contact us.
- Check for duplicates before posting, duplicates may be removed
Approved Bots
robots.txt, I guess? Yes, you can just ignore it, but you shouldn't, if you develop a responsible web scraper.
Doesn’t seem legal that a robots.txt could pick and choose who scrapes. Seems like legally it would have to be all or nothing. Here’s hoping one of the search engines ignores it and makes it a legal case.
You'd probably feel differently if it were your service. Should you be able to control who scrapes your sites or should that be all or nothing?
For the record, I fucking hate what the internet is becoming. I naively believed that even if shit got cordoned off into the walled gardens that are mobile phone apps, the web would remain as open as it was. This is a terrible sign of things to come.
Actually currently it contains this:
User-agent: *
Disallow: /
Well, that actually is a blanket ban for everyone, so something else must be at play here.
https://merj.com/blog/investigating-reddits-robots-txt-cloaking-strategy
Reddit is serving different file to google
We believe in the open internet, but we do not believe in the misuse of public content.
That's real rich, coming from Reddit.
Also, rate limiting. A publicly accessible website doesn't mean that it will allow scrapers to read millions of pages each week. They can easily identify and block scrapers because of the pattern of their activity. I don't know if Reddit has rate-limiting, but I wouldn't be surprised if they implement one.
Still couldn't get me to use it, I use DDG which can switch between search engines and search sites very quickly with it's ! syntax (Everyone goes on about privacy, but this is pretty much it's best feature). Google results are consistently the worst for me if I'm hitting multiple search engines
Reddit really fucked themselves. Not as much as Elon fucked twitter but super close.
Also pretty sure DDG uses Bing
Fuck You Reddit
If you use Bing, DuckDuckGo, Mojeek, Qwant or any other alternative search engine that doesn’t rely on Google’s indexing and search Reddit by using “site:reddit.com,” you will not see any results from the last week.
That's absolutely insane... Reddit truly is making things awful. The "just add reddit" or "just add site:reddit.com" has been trash for a while because they bombard you with the "pwease use the app" and not showing more than like three comments at a time. It's useless.
Meh, fuck em. The tighter they make their circle the less useful it is.
Reminder that Kagi searches Lemmy which is great.
I searched for a specific post: No result from DDG, correct result from google. I will try again in a day. https://duckduckgo.com/?q=%22advise+for+bringing+my+son+into+the+fold%22+reddit&t=fpas&ia=web https://www.google.com/search?hl=en&q=%22advise%20for%20bringing%20my%20son%20into%20the%20fold%22%20reddit
To be fair, Reddit is no longer that good of a source for answers in the later years.
Quality drop in comments is insane. Sometimes it looks like Quora.
I was looking for Bluetooth speakers recommendations and it's the first time I really noticed "generic bot replies" like "I've got this great product to recommend, not only is it good but it offers great sound quality as well! The product is [link to Amazon page]"
Gotta start searching using "before:" to get quality results...
Oh cool! My searching won't be spammed with Reddit now!
Poople
Another nail.
Every time I click a Reddit link now it’s just “download the app to verify your age” regardless of what it is
I feel your pain.
I edit the URL to remove the first part of the URL and replace it with "http://old.reddit.com". That still seems to work, last I checked, but I fully expect it to be killed any day now.