this post was submitted on 07 Feb 2025
272 points (91.2% liked)

Technology

61850 readers
3157 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[–] Treczoks@lemmy.world 11 points 13 hours ago* (last edited 3 hours ago)

wrecks the thing I care most about: copying and pasting details that I need to write articles. Instead, I often get garbled, shortened pieces of other parts of the document intermingled with the text I want—assuming I can even select it in the first place.

There are two things doing this: PDF optimisation and document obfuscation.

The Optimisation thing is something I've seen with many Asian PDFs. If they want to use a non-standard font, and want the document to actually use it, they have to embed it into the PDF, potentially blowing it up size-wise. In comes the optimiser: It looks which of the thousands of glyphs of that Asian language are actually used in that document, and creates a new font with only those glyphs. This font has a totally different numbering scheme from the original font, so it also replaces the numbers in the document that represent those glyphs. Result: A much smaller PDF. It looks the same, it prints the same. You can still "copy" the characters, but as their only meaning is related to the internal representation of the font, you cannot past them into e.g. Google Translate. It's just gibberish.

Example: The text is "Jack and Jill", and the numbers in the document representing those characters would be ASCII/UNICODE: 74 97 99 107 32 97 110 100 32 74 105 108 108 (74 being 'J', 97 being 'a', etc.). This is standard and works basically everywhere. The optimizer sees the letters " Jacdikln" (sorted) and assigns them numbers starting with e.g. 0 for " " (space), 1 for "J", etc. The images for all other characters are thrown away, as they are not needed. The internal numbers for the text are now 1 2 3 6 0 2 8 4 0 1 5 7 7, which are not standard ASCII/UNICODE, and copying them to another application would just result in problems.

The Obfuscation is often done by putting additional text in the background color behind the main text. You cannot see it, it does not show up in prints, but when you select a piece of text, it gets copied along, if you like it or not.

So you see "Jack and Jill" in black, but behind it is "went up the hill" in white, and you copy something like "Jacwentk upandth hiell".