Bluesky April 2026 Outage Post-Mortem

(pckt.blog)

139 pontos | por jcalabro 4 dias atrás

17 comentários

threecheese
4 dias atrás
> What I had missed is that we deployed a new internal service last week that sent less than three GetPostRecord requests per second, but it did sometimes send batches of 15-20 thousand URIs at a time. Typically, we'd probably be doing between 1-50 post lookups per request.
That’ll do it.
[-]
- 98codes
  4 dias atrás
  Ahh, the three relevant numbers in development: 0, 1, and infinity.
- jandrese
  3 dias atrás
  The incredible part about this is because their backend is all TCP/IP they were literally exhausting the ports by leaving all 65k of them in TIME_WAIT, and the workaround was to start randomizing the localhost address to give them another trillion ports or so.
  [-]
  - kyledrake
    3 dias atrás
    This is a pretty interesting solution. I could see how this could useful for certain kinds of problems (as part of a ddos attack mitigation for example).
    [-]
    - verdverm
      3 dias atrás
      An oldie and a goodie
      https://news.ycombinator.com/item?id=21865715
  - Night_Thastus
    3 dias atrás
    I mean, it's one GetPostRecord, Michael. What could it cost? 1 trillion ports?
- bombcar
  4 dias atrás
  Zero, one, many, many thousands.
- LoganDark
  3 dias atrás
  And then they fix the issue by using multiple localhost IPs rather than, perhaps, not sending 15-20 thousand URIs at a time
  [-]
  - odo1242
    3 dias atrás
    They mentioned it was a temporary fix that they removed after finding and fixing the true root cause, though.
- htx80nerd
  3 dias atrás
  less than ideal if I had to be frank.
pembrook
3 dias atrás
Distributed social media goes down? hrmmm.
Email and the internet don't have "downtime." Certain key infra providers do of course. ISPs can go down. DNS providers can go down. But the internet and email itself can't go down absent a global electricity outage.
You haven't built a decentralized network until you reach that standard imo. Otherwise its just "distributed protocol" cosplay. Nice costume. Kind of like how everybody has been amnesia'd into thinking Obsidian is open source when it really isn't.
[-]
- iAMkenough
  3 dias atrás
  Bluesky is a provider. Blacksky didn’t go down.
  [-]
  - pembrook
    3 dias atrás
    Is there anything running on Blacksky other than Bluesky with more than say, 100 active users?
    AOL never even got to that level of dominance in the internet 1.0 era.
    The point is it's not a distributed network if one node is 99.9% of all traffic.
tapoxi
3 dias atrás
I don't really understand this architecture, but I thought Bluesky was distributed like Mastodon? How can it have an outage?
[-]
- pfraze
  3 dias atrás
  This writeup is useful for backend engineers: https://atproto.com/articles/atproto-for-distsys-engineers
  The simple answer is that atproto works like the web & search engines, where the apps aggregate from the distributed accounts. So the proper analogy here would be like yahoo going down in 1999.
  [-]
  - tapoxi
    3 dias atrás
    This is a fantastic write-up, thanks for sharing!
  - fiatjaf
    3 dias atrás
    Sorry, but this analogy is very misleading, no one browses websites through Google's servers.
    For example, right now in my URL bar I read "news.ycombinator.com", not "google.com/profile/news.ycombinator.com".
    If Google goes down now I can keep browsing this website and all the other websites I have in all my other tabs as if nothing had happened.
    [-]
    - danabramov
      3 dias atrás
      Does Google Reader help you make sense of it? It’s more like each app is like its own Google Reader. And indeed you were able to access the same posts via other apps at that time of outage.
    - pfraze
      3 dias atrás
      Technically you can still view the posts directly from the PDS. It’s just uninteresting compared to web pages
    - _heimdall
      3 dias atrás
      > no one browses websites through Google's servers.
      Didn't Google's AMP project do exactly that?
    - evbogue
      3 dias atrás
      Do you have ideas about how Bluesky could decentralize?
      [-]
      - wmf
        3 dias atrás
        Not the original poster but I do have some ideas. Official Bluesky clients could randomly/round-robin access 3-4 different appview servers run by different organizations instead of one centralized server. Likewise there could be 3-4 relays instead of one. Upgrades could roll across the servers so they don't all get hit by bugs immediately.
        [-]
        evbogue
        3 dias atrás
        If multiple personal data servers (pdses) share the same set of posts how would we guarantee that they are tamper resistant to third parties?
        [-]
        wmf
        3 dias atrás
        PDSes should be sharded not replicated. Your posts live on your PDS which lives in one place (although it can move).
        [-]
        evbogue
        3 dias atrás
        What's stopping us from doing both?
        [-]
        wmf
        3 dias atrás
        Cost and complexity tradeoffs. IMO the relay/appview is the current bottleneck.
        [-]
        evbogue
        3 dias atrás
        This is why I'm hoping fiatjaf has a recommendation here. I have a feeling he might have a proposal that solves this. But doesn't solve all of it, just some of it.
  - isodev
    3 dias atrás
    Google and MSN Search were already available at this time. Also websites used to publish webrings and there was IRC and forums to ask people about things.
- isodev
  3 dias atrás
  It’s more of a concept of a plan for being distributed. I even went through the trouble of hosting my own PDC and still, I was unable to use the service during the outage
- direwolf20
  3 dias atrás
  It's not really distributed. It's a centralised service that pulls some parts of 0.01% of user profiles from their own servers.
- Retr0id
  3 dias atrás
  Mastodon infra can have outages, too.
  [-]
  - tapoxi
    3 dias atrás
    It's just confined to one instance if it goes down, not all of Mastodon.
- chr15m
  3 dias atrás
  "decentralized"
- LoganDark
  3 dias atrás
  A web interface and home server can have an outage. Bluesky is just a web interface and home server.
opem
3 dias atrás
At least they aren't hiding and transparent about it unlike the big tech corps with so called SLAs
[-]
- tmpz22
  3 dias atrás
  There are no outages in Azure sing se.
  [-]
  - thedrexster
    1 dia atrás
    i see you, brother! <3
  - _heimdall
    3 dias atrás
    GitHub's Ops team would approve this message, I assume.
mwkaufma
3 dias atrás
Tell us more about this buggy "new internal service" that's scraping batch data :P
heliumtera
3 dias atrás
Good to know the discussion about decentralization and federation had finally ended
goekjclo
4 dias atrás
> The timing of these log spikes lined up with drops in user-facing traffic, which makes sense. Our data plane heavily uses memcached to keep load off our main Scylla database, and if we're exhausting ports, that's a huge problem.
I expect this is common.
jonstaab
3 dias atrás
nostr never goes down
[-]
- jandrese
  3 dias atrás
  If nostr went down would people even notice?
  [-]
  - yangm97
    2 dias atrás
    The good thing about nostr is that, contrary to popular federated protocols, your identity is not tied to any single server, you own the keypair to your account, so you can continue using it just fine even if some relays experience a downtime.
    [-]
    - ramblejam
      1 dia atrás
      until one day when a quantum computer comes along, then goodbye your control of your nostr identity in both the future and the past!
  - nout
    3 dias atrás
    If any major nostr relay goes down, no one notices. That has happened many times, the network is very resilient to that.
  - jonstaab
    3 dias atrás
    probably not
- pfraze
  3 dias atrás
  All support to other decentralizers but nothing never goes down.
  [-]
  - nout
    3 dias atrás
    The comparison here is to something like TCP/IP. TCP/IP never goes down. TCP/IP is a protocol, the servers may go down and cause disruption, but the protocol doesn't really have the ability to "go down". Nostr is also a protocol. The communication on top of Nostr is pretty resilient compared to other solutions though, so that's the main highlight here.
    If tens of servers go down, then some people may start noticing a bit of inconvenience. If hundreds of servers go down, then some people may need to coordinate out of bound on what relays to use, but it still generally speaking works ok.
    [-]
    - arter45
      3 dias atrás
      That's because TCP/IP is a protocol, not a (centralized or decentralized) server. A protocol cannot go down. It can trigger failures, it can be abused, but it cannot go down.
      It's like saying "English never burns". Sure, you can't burn English but you can burn specific books, newspapers and so on.
      [-]
      - nout
        2 dias atrás
        That's... literally the point I just made in my reply?
  - jonstaab
    3 dias atrás
    1000x redundancy makes it vanishingly unlikely. Although I know we're due for a pole shift so all bets are off I suppose.
    [-]
    - numpad0
      3 dias atrás
      Wasn't aware there are ~2k relays now. Have inter-relay sharing situation improved?
      When I tried it long time ago, the idea was just a transposed Mastodon model that the client would just multi-post to dozen different servers(relays) automatically to be hopeful that the post would be available in at least one shared relays between the user and their followers. That didn't seem to scale well.
      [-]
      - jonstaab
        3 dias atrás
        Getting clients to do the right thing is like herding cats, but there has been some progress. Early 2023 Mike Dilger came up with the "gossip model" (renamed "outbox model" for obvious reasons). Here's my write-up: https://habla.news/hodlbod/8YjqXm4SKY-TauwjOfLXS
        The basic idea is that for microblogging use cases users advertise which relays their content is stored on, which clients follow (this implies that there are less-decentralized indexes that hold these pointers, but it does help distribute content to aligned relays instead of blast content everywhere).
        Also, relays aside, one key difference vs ActivityPub is that no third party owns your identity, which means you can move from one relay to another freely, which is not true on Mastodon.
        [-]
        numpad0
        2 dias atrás
        Thanks! Not to be critical - more like thinking out loud - and don't have solutions to following myself - but that sounds like it could 1) affect negatively to power concentrating into the top popular relays -> potentially leading to same kind of speech issues as semi-centralized ActivityPub, and 2) it won't solve need to maintain multiple firehose connections.
        I've been wondering if the multi-firehose architecture is really where decentralized censorship resistant microblogging should be the way forward; I remember the Windows Mobile clients for 2ch.net(today 5ch.io) that would scrape thread deltas from bunch of subdomains under it was plenty fast on 128k(advertised) connection to get thousands of posts in late 2000s, and so I think an RSS style of systems getting delta updates from multiple domains could work without having to do the insanity of early Nostr, or massive liabilities for instance operators with Mastodon, especially if those multiple domains could be set up with relative ease.
        Yeah, I don't exactly understand why you have to sign up every time to Mastodon servers and server operators to have to be responsible about users. It worked when it was urgently needed, which was brilliant, but the ID system had under baked spots.
        [-]
        jonstaab
        14 horas atrás
        Yeah, any time you need either an index or a caching layer you have to re-centralize one way or another. But decoupling those "services" from the data storage itself helps, and credible exit makes the gatekeepers far less powerful. An example: a few weeks ago nostr.band, one of nostr's main indexers/search services went away. Search is still somewhat impacted (evidence that we were centralized around it), but indexing (i.e. finding users' relay lists) is still covered by several other services.
      - chr15m
        3 dias atrás
        A difference with Mastodon is your account is independent of any relay.
        > scale well
        It is up and it is growing.
  - bit1993
    3 dias atrás
    Bitcoin, BitTorrent never go down.
- emidoots
  3 dias atrás
  There's stark contrast for an average human visiting the landing page of bsky.app vs nostr.org
  [-]
  - jonstaab
    3 dias atrás
    That's what decentralization looks like. You might also try:
    nostr.com nostr.how nostr.net nostrich.love nostrhub.io usenostr.org And of course https://github.com/nostr-protocol/nostr
gsibble
3 dias atrás
Did all 3 users notice?
[-]
- dogemaster2027
  3 dias atrás
  [dead]
- ffsm8
  3 dias atrás
  Naw, only one did. Turns out the other two were his socket accounts he used to upvote and comment on his own content.
  Okay, nuff trolling for today
mwagstaff
3 dias atrás
With my SRE hat on, dare I ask... could/should this have been picked up in testing?
And then normally there's a nice discussion about how production is very different to the test environment.
dogemaster2027
3 dias atrás
[dead]
templar_snow
4 dias atrás
[flagged]
[-]
- lavela
  4 dias atrás
  Why?
rvz
3 dias atrás
Thank you for the post mortem on this outage.
electrondood
3 dias atrás
Great write up... curious about the RCA. Thanks!
drewg123
3 dias atrás
Golang's use of a potentially unbounded number of threads is just insane. I used to be fairly bullish on golang, but this, combined with the fact that its garbage collected, makes me feel its just unsuitable for production use.
[-]
- floating-io
  3 dias atrás
  You can have this problem with any kind of thread -- including OS threads -- if you do an unbounded spawn loop. Go is hardly unique in this.
  Goroutines are actually better AFAIK because they distribute work on a thread pool that can be much smaller than the number of active goroutines.
  If my quick skim created a correct understanding, then the problem here looks more like architecture. Put simply: does the memcached client really require a new TCP connection for every lookup? I would think you would pool those connections just like you would a typical database and keep them around for approximately forever. Then they wouldn't have spammed memcache with so many connections in the first place...
  (edit: ah, it looks like they do use a pool, but perhaps the pool does not have a bounded upper size, which is its own kind of fail.)
  [-]
  - slopinthebag
    3 dias atrás
    Rust's async doesn't have this issue. Or at least, it's the same issue as malloc in an unbounded loop, but that's a more general issue not related to async or threading.
    15-20 thousand futures would be trivial. 15-20 thousand goroutines, definitely not.
    [-]
    - floating-io
      3 dias atrás
      I don't know enough about rust to confirm or deny that -- but unless rust somehow puts a limit on in-flight async operations, I don't see how it would help.
      The problem is not resource usage in go. The problem is that they created umpteen thousand TCP connections, which is going to kill things regardless of the language.
      [-]
      - verdverm
        3 dias atrás
        case in point, an old HN post about scaling Go to 1M websocket connections
        https://news.ycombinator.com/item?id=21865715
        [-]
        slopinthebag
        3 dias atrás
        Go can scale but only if you sidestep goroutines and use something like https://github.com/lesismal/nbio (which is awesome, highly endorse)
- tombert
  3 dias atrás
  Why does garbage collection make it unsuitable for production use? A lot of production software is written in garbage collected languages like Java. Pretty much the entire backend for iTunes/Apple Music is written in Java, and it's not doing any kind of fancy bump allocator tricks to avoid garbage. In my mind, kind of hard to argue that Apple Music is not "production use".
  There are certainly plenty of projects where garbage collection is too slow, but I don't know that they're the majority, and more people would likely prefer memory safety by default.
  [-]
  - madeofpalk
    3 dias atrás
    Based on my experience of Apple Music being pretty bad at streaming music, i would say that it's not ready for 'production use'.
    [-]
    - tombert
      3 dias atrás
      Ok, judging by this job posting [1] it looks like Spotify uses Java as well.
      [1] https://www.lifeatspotify.com/jobs/senior-backend-engineer-a...
  - slopinthebag
    3 dias atrás
    Everything is understood by comparison. Unsuitable for production use, compared to what is the more apt question.
streetfighter64
3 dias atrás
> They represent real user-facing downtime
Off-topic, but "real" feels like the new "delve". Is there such a thing as "fake" or "virtual" downtime, or why do people feel the need to specify that all manner of things are "real" nowadays?
jmclnx
4 dias atrás
Lite Blue on a dark Blue background. That is a new one, I have seen grey text on lite grey, but blue on blue ?
The article does work in lynx, at least I can read it.