I really hope they die soon, this is unbearable…

  • ohshit604@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    1
    ·
    edit-2
    5 hours ago

    For a while my GoAccess instance wasn’t working properly so I couldn’t visualize my access logs from Traefik, got lazy trying to fix it and left it as is, well in the meantime I wasn’t lazy enough to setup Synapse and begin federating on my home network.

    Finally fixed my GoAccess today to be surprised to see Synapse hits labelled as crawlers, well over a million hits.

  • sudoer777@lemmy.ml
    link
    fedilink
    English
    arrow-up
    13
    ·
    edit-2
    12 hours ago

    I’m okay with a few crawlers, but not what’s effectively a DDoS attack by AI companies who abuse my resources generating terabytes of traffic and crashing my server while costing me money. I use Anubis now, which sucks from an accessibility standpoint but I’m not dealing with their malicious traffic anymore.

    • antrosapien@lemmy.ml
      link
      fedilink
      English
      arrow-up
      1
      ·
      10 hours ago

      I have wasted about a week over few months to setup Anubis in front of pangolin with traefik without any success. Starting from scratch every time

    • hoppolito@mander.xyz
      link
      fedilink
      English
      arrow-up
      5
      ·
      1 day ago

      I ended up adding go-away in front of my code forge and anything showing dynamic info, and it turned out to be way less of a hassle than I feared with two redirects and a couple custom rules.

      If you already have traefik redirecting to your services, shouldn’t be too tough to get the extra layer of indirection added (even more so if it’s containerized).

  • punrca@piefed.world
    link
    fedilink
    English
    arrow-up
    27
    arrow-down
    3
    ·
    1 day ago

    It’s best to use either Cloudflare (best IMO) or Anubis.

    1. If you don’t want any AI bots, then you can setup Anubis (open source; requires JavaScript to be enabled by the end user): https://github.com/TecharoHQ/anubis

    2. Cloudflare automatically setups robots.txt file to block “AI crawlers” (but you can setup to allow “AI search” for better SEO). Eg: https://blog.cloudflare.com/control-content-use-for-ai-training/#putting-up-a-guardrail-with-cloudflares-managed-robots-txt

    Cloudflare also has an option of “AI labyrinth” to serve maze of fake data to AI bots who don’t respect robots.txt file.

    • shane@feddit.nl
      link
      fedilink
      English
      arrow-up
      21
      arrow-down
      7
      ·
      1 day ago

      If you’re relying on Cloudflare are you even self-hosting?

    • AHemlocksLie@lemmy.zip
      link
      fedilink
      English
      arrow-up
      13
      arrow-down
      1
      ·
      1 day ago

      Pretty sure I’ve repeatedly heard about the crawlers completely ignoring robots.txt, so does Cloudflare really do that much?

      • tomjuggler@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        ·
        4 hours ago

        Yes, CloudFlare blocks agents completely if they ignore it’s restrictions. The key is scale - CloudFlare has a birds eye view of traffic patterns across millions of sites and can do statistical analysis to determine who is a bot.

        I hate the necessity but it works

      • Sv443@sh.itjust.works
        link
        fedilink
        English
        arrow-up
        8
        arrow-down
        1
        ·
        24 hours ago

        Like a lock on a door, it stops the vast majority but can’t do shit about the actual professional bad guys

  • early_riser@lemmy.world
    link
    fedilink
    English
    arrow-up
    93
    ·
    2 days ago

    It’s already hard enough for self-hosters and small online communities to deal with spam from fleshbags, now we’re being swarmed by clankers. I have a little Mediawiki to document my deranged maladaptive daydreams worldbuilding and conlanging projects, and the only traffic besides me is likely AI crawlers.

    I hate this so much. It’s not enough that huge centralized platforms have the network effect on their side, they have to drown our quiet little corners of the web under a whelming flood of soulless automata.

    • wonderingwanderer@sopuli.xyz
      link
      fedilink
      English
      arrow-up
      44
      arrow-down
      1
      ·
      2 days ago

      Anubis is supposed to filter out and block all those bots from accessing your webpage.

      Iocaine, nepenthes, and/or madore’s book of infinity are intended to redirect them into a maze of randomly generated bullshit, which still consumes resources but is intended to poison the bots’ training data.

      So pick your poison

      • MonkeMischief@lemmy.today
        link
        fedilink
        English
        arrow-up
        21
        arrow-down
        1
        ·
        1 day ago

        Iocaine, nepenthes, and/or madore’s book of infinity are intended to redirect them into a maze of randomly generated bullshit

        We’ve officially reached a place where cyberspace is beginning to look like communing with the arcane. Lol

        • wonderingwanderer@sopuli.xyz
          link
          fedilink
          English
          arrow-up
          3
          ·
          22 hours ago

          I wonder if someone techy can turn the Sworn Book of Honorius into a software program that actually summons spirits and grants powers.

          Fun fact though, Trithemius (an influential Renaissance occultist) authored the Steganographia, which provided the basis upon which modern cryptography was built.

          • MonkeMischief@lemmy.today
            link
            fedilink
            English
            arrow-up
            1
            ·
            edit-2
            5 hours ago

            That IS a fun fact. Super cool!

            Hah, reading the introduction to this book out of curiosity…

            …And he through the council of a certain angel whose name was Hocroel, did write seven volumes of art magic, giving to us the kernel, and to others the shells.

            👀

          • MonkeMischief@lemmy.today
            link
            fedilink
            English
            arrow-up
            1
            ·
            6 hours ago

            Oh we’ve got bots for every vice and deadly sin now, taking after their creators.

            Kinda neat that for now, we’ve found a way to Dr. Strange mirror-dimension them for the time being. I hope those techniques proliferate quickly.

            I don’t care what the “commercial net” does at this point. I just want the indie web to survive.

    • NewNewAugustEast@lemmy.zip
      link
      fedilink
      English
      arrow-up
      17
      ·
      edit-2
      1 day ago

      I was up 10 to 20 percent month over month, and suddenly up 1000% it has spiked hard and they all are data harvesters.

      I know I am going to start blocking them, which is too bad, I put valuable technical information up, with no advertising, because I want to share it. And I don’t even really mind indexers or even AI learning about it. But I cannot sustain this kind of bullshit traffic, so I will end up taking a heavy hand and blocking everything, and then no one will find it.

  • eli@lemmy.world
    link
    fedilink
    English
    arrow-up
    9
    ·
    1 day ago

    I ended up just pushing everything behind my tailnet and only leave my game server ports open(which are non-standard ports).

  • Thorry@feddit.org
    link
    fedilink
    English
    arrow-up
    54
    ·
    2 days ago

    Yeah I had the same thing. All of a sudden the load on my server was super high and I thought there was a huge issue. So I looked at the logs and saw an AI crawler absolutely slamming my server. I blocked it, so it only got 403 responses but it kept on slamming. So I blocked the IPs it was coming from in iptables, that helped a lot. My little server got about 10000 times the normal traffic.

    I sorta get they want to index stuff, but why absolutely slam my server to death? Fucking assholes.

    • Ephera@lemmy.ml
      link
      fedilink
      English
      arrow-up
      16
      ·
      2 days ago

      My best guess is that they don’t just index things, but rather download straight from the internet when they need fresh training data. They can’t really cache the whole internet after all…

      • Techlos@lemmy.dbzer0.com
        link
        fedilink
        English
        arrow-up
        13
        ·
        2 days ago

        Bingo, modern datasets are a list of URL’s with metadata rather than the files themselves. Every new team/individual wanting to work with the dataset becomes another DDoS participant.

      • Spice Hoarder@lemmy.zip
        link
        fedilink
        English
        arrow-up
        7
        ·
        1 day ago

        The sad thing is that they could cache the whole internet if there was a checksum protocol.

        Now that I’m thinking about it, I actually hate the idea that there are several companies out there with graph databases of the entire internet.

  • e8CArkcAuLE@piefed.social
    link
    fedilink
    English
    arrow-up
    39
    arrow-down
    1
    ·
    edit-2
    2 days ago

    that’s the kind of shit we pollute our air and water for…and properly seal and drive home the fuckedness of our future and planet.

    i totally get you sending them to nepenthes though.

  • CoreLabJoe@piefed.ca
    link
    fedilink
    English
    arrow-up
    25
    arrow-down
    2
    ·
    2 days ago

    Blocking them locally is one way, but if you’re already using cloudflare there’s a nice way to do it UPSTREAM so it’s not eating any of your resources.

    You can do geofencing/blocking and bot-blocking via Cloudflare:
    https://corelab.tech/cloudflarept2/

  • x00z@lemmy.world
    link
    fedilink
    English
    arrow-up
    4
    arrow-down
    3
    ·
    1 day ago

    50% of my traffic is scrapers now. I really want to block them but I also want my content to be indexed and used for LLMs. At the moment there isn’t really an in-between way of doing that. :(

    (This is with me knowing they fuck up the electricity nets and memory chips, I’m just hoping that gets better soon.)

      • x00z@lemmy.world
        link
        fedilink
        English
        arrow-up
        4
        ·
        1 day ago

        I work on a project that has a lot of older, less technical and international users who could use some extra help. We’re also not always found by the people that would benefit from our project. https://keeperfx.net/

      • lost_screwdriver@thelemmy.club
        link
        fedilink
        English
        arrow-up
        7
        arrow-down
        1
        ·
        1 day ago

        That they do not become lie machines. Propaganda, lies and fake news from various different sources gets spammed all across the internet. If AI picks it up, it can just spread misinformation, especially if all trustworthy or useful sources block them

        • poVoq@slrpnk.net
          link
          fedilink
          English
          arrow-up
          3
          ·
          1 day ago

          This will just make them sound more believable when they hallucinate. LLMs can conceptually not be made to not lie, even if all the info they are trained on is 100% accurate.