My hypothesis about that thing breaking my twts is that it might have something to do with the parenthesis surrounding the root twt hash in the replay twt-A when I replay to it with fork-twt-B; I imagine elisp interpreting those as a s-expression thus breaking the generation precess of hash (#twt-A) before prepending it to for-twt-B ⌠but then Iâm too ignorant to figure out how to test my theory (heck I couldnât even recalculate the hashes myself correctly in bash xD). Iâll keep trying tho.
@andros@twtxt.andros.dev yes, that usually happens when twts get edited and we just made a gentlemen agreement to avoid edits as much as possible (at least for the time being). But the thing is, That is not whatâs happening with my broken twtsâ hashes. Since Iâve bee mostly replaying to my own twts as a test and I know for sure that I havenât edited any. (I usually fork-replay instead of edit a twt when needed)
@prologic@twtxt.net Agreed! But clients can hallucinate and generate wrong hashes aka Lies 𤣠Also, If you chheck your own twt on twtxt.net, it looks like a root twt instead of a replay.
the hash in that test replay should have been s243lua instead đ¤ˇ
@prologic@twtxt.net Are you sure? xD ⌠it was supposed to be a replay to another twt, but the twt hash is wrong (I think).
@andros@twtxt.andros.dev is it me or twtxt-el generates a wrong twt hash when I use the [ âł Reply to twt ] button?
@bender@twtxt.net So turns out something is setting my HashingURI to the value {{ .Profile.URI }} and that is making my hashes wrong so it cannot delete or edit twts.
@movq@www.uninformativ.de iâm sorry if I sound too contrarian. Iâm not a fan of using an obscure hash as well. The problem is that of future and backward compatibility. If we change to sha256 or another we donât just need to support sha256. But need to now support both sha256 AND blake2b. Or we devide the community. Users of some clients will still use the old algorithm and get left behind.
Really we should all think hard about how changes will break things and if those breakages are acceptable.
I share I did write up an algorithm for it at some point I think it is lost in a git comment someplace. Iâll put together a pseudo/go code this week.
Super simple:
Making a reply:
- If yarn has one use that. (Maybe do collision check?)
- Make hash of twt raw no truncation.
- Check local cache for shortest without collision
- in SQL:
select len(subject) where head_full_hash like subject || '%'
- in SQL:
Threading:
- Get full hash of head twt
- Search for twts
- in SQL:
head_full_hash like subject || '%' and created_on > head_timestamp
- in SQL:
The assumption being replies will be for the most recent head. If replying to an older one it will use a longer hash.
These collisions arenât important unless someone tries to fork. So.. for the vast majority its not a big deal. Using the grow hash algorithm could inform the client to add another char when they fork.
Thanks @david@collantes.us, good to know, but we need to agree on what character we use, otherwise the hashes will not be the same:)
@prologic@twtxt.net Regarding the new way of generating twt-hashes, to me it makes more sense to use tabs as separator instead of spaces, since the you can just copy/past a line directly from a twtxt-file that already go a tab between timestamp and message. But tabs might be hard to âtypeâ when you are in a terminal, since it will activate autocompleteâŚđ¤
Another thing, it seems that you sugget we only use the domain in the hash-creation and not the full path to the twtxt.txt
$ echo -e "https://example.com 2024-09-29T13:30:00Z Hello World!" | sha256sum - | awk '{ print $1 }' | base64 | head -c 12
More thoughts about changes to twtxt (as if we havenât had enough thoughts):
- There are lots of great ideas here! Is there a benefit to putting them all into one document? Seems to me this could more easily be a bunch of separate efforts that can progress at their own pace:
1a. Better and longer hashes.
1b. New possibly-controversial ideas like edit: and delete: and location-based references as an alternative to hashes.
1c. Best practices, e.g. Content-Type: text/plain; charset=utf-8
1d. Stuff already described at dev.twtxt.net that doesnât need any changes.
We wonât know what will and wonât work until we try them. So Iâm inclined to think of this as a bunch of draft ideas. Maybe later when weâve seen it play out it could make sense to define a group of recommended twtxt extensions and give them a name.
Another reason for 1 (above) is: I like the current situation where all you need to get started is these two short and simple documents:
https://twtxt.readthedocs.io/en/latest/user/twtxtfile.html
https://twtxt.readthedocs.io/en/latest/user/discoverability.html
and everything else is an extension for anyone interested. (Deprecating non-UTC times seems reasonable to me, though.) Having a big long âtwtxt v2â document seems less inviting to people looking for something simple. (@prologic@twtxt.net you mentioned an anonymous comment âyouâve ruined twtxtâ and while I donât completely agree with that commenterâs sentiment, I would feel like twtxt had lost something if it moved away from having a super-simple core.)All that being said, these are just my opinions, and Iâm not doing the work of writing software or drafting proposals. Maybe I will at some point, but until then, if youâre actually implementing things, youâre in charge of what you decide to make, and Iâm grateful for the work.
@prologic@twtxt.net Done. Also, I went ahead and made two changes: changed hexadecimal to base64 for hashes (wasnât sure if anyone objected), and changed âMUST follow the chainâ to âSHOULD follow the chain.
(#2024-09-24T12:44:35Z) There is a increase in space/memory for sure. But calculating the hashes also takes up CPU. Iâm not good with that kind of math, but itâs a tradeoff either way.
@falsifian@www.falsifian.org I believe the preserve means to include the original subject hash in the start of the twt such as (#somehash)
Sorry, youâre right, I should have used numbers!
Iâm donât understand what âpreserve the original hashâ could mean other than âmake sure thereâs still a twt in the feed with that hashâ. Maybe the text could be clarified somehow.
Iâm also not sure what you mean by markdown already being part of it. Of course people can already use Markdown, just like presumably nothing stopped people from using (twt subjects) before they were formally described. But itâs not universal; e.g. as a jenny user I just see the plain text.
@prologic@twtxt.net Thanks for writing that up!
I hope it can remain a living document (or sequence of draft revisions) for a good long time while we figure out how this stuff works in practice.
I am not sure how I feel about all this being done at once, vs. letting conventions arise.
For example, even today I could reply to twt abc1234 with â(#abc1234) Edit: âŚâ and I think all you humans would understand it as an edit to (#abc1234). Maybe eventually it would become a common enough convention that clients would start to support it explicitly.
Similarly we could just start using 11-digit hashes. We should iron out whether itâs sha256 or whatever but thereâs no need get all the other stuff right at the same time.
I have similar thoughts about how some users could try out location-based replies in a backward-compatible way (append the replyto: stuff after the legacy (#hash) style).
However I recognize that Iâm not the one implementing this stuff, and itâs less work to just have everything determined up front.
Misc comments (I havenât read the whole thing):
Did you mean to make hashes hexadecimal? You lose 11 bits that way compared to base32. Iâd suggest gaining 11 bits with base64 instead.
âClients MUST preserve the original hashâ â do you mean they MUST preserve the original twt?
Thanks for phrasing the bit about deletions so neutrally.
I donât like the MUST in âClients MUST follow the chain of reply-to referencesâŚâ. If someone writes a client as a 40-line shell script that requires the user to piece together the threading themselves, IMO we shouldnât declare the client non-conforming just because they didnât get to all the bells and whistles.
Similarly I donât like the MUST for user agents. For one thing, you might want to fetch a feed without revealing your identty. Also, it raises the bar for a minimal implementation (Iâm again thinking again of the 40-line shell script).
For âwho followsâ lists: why must the long, random tokens be only valid for a limited time? Do you have a scenario in mind where they could leak?
Why canât feeds be served over HTTP/1.0? Again, thinking about simple software. I recently tried implementing HTTP/1.1 and it wasnât too bad, but 1.0 would have been slightly simpler.
Why get into the nitty-gritty about caching headers? This seems like generic advice for HTTP servers and clients.
Iâm a little sad about other protocols being not recommended.
I donât know how I feel about including markdown. I donât mind too much that yarn users emit twts full of markdown, but Iâm more of a plain text kind of person. Also it adds to the length. I wonder if putting a separate document would make more sense; that would also help with the length.
I wrote some code to try out non-hash reply subjects formatted as (replyto ), while keeping the ability to use the existing hash style.
I donât think we need to decide all at once. If clients add support for a new method then people can use it if they like. The downside of course is that this costs developer time, so I decided to invest a few hours of my own time into a proof of concept.
With apologies to @movq@www.uninformativ.de for corrupting jennyâs beautiful code. I donât write this expecting you to incorporate the patch, because it does complicate things and might not be a direction you want to go in. But if you like any part of this approach feel free to use bits of it; I release the patch under jennyâs current LICENCE.
Supporting both kinds of reply in jenny was complicated because each email can only have one Message-Id, and because itâs possible the target twt will not be seen until after the twt referencing it. The following patch uses an sqlite database to keep track of known (url, timestamp) pairs, as well as a separate table of (url, timestamp) pairs that havenât been seen yet but are wanted. When one of those âwantedâ twts is finally seen, the mail file gets rewritten to include the appropriate In-Reply-To header.
Patch based on jenny commit 73a5ea81.
https://www.falsifian.org/a/oDtr/patch0.txt
Not implemented:
- Composing twts using the (replyto âŚ) format.
- Probably other important things Iâm forgetting.
@quark@ferengi.one It does not. That is why Iâm advocating for not using hashes for treads, but a simpler link-back scheme.
the stem matching is the same as how GIT does its branch hashes. i think you can stem it down to 2 or 3 sha bytes.
if a client sees someone in a yarn using a byte longer hash it can lengthen to match since it can assume that maybe the other client has a collision that it doesnt know about.
@prologic@twtxt.net the basic idea was to stem the hash.. so you have a hash abcdef0123456789... any sub string of that hash after the first 6 will match. so abcdef, abcdef012, abcdef0123456 all match the same. on the case of a collision i think we decided on matching the newest since we archive off older threads anyway. the third rule was about growing the minimum hash size after some threshold of collisions were detected.
@prologic@twtxt.net Wikipedia claims sha1 is vulnerable to a âchosen-prefix attackâ, which I gather means I can write any two twts I like, and then cause them to have the exact same sha1 hash by appending something. I guess a twt ending in random junk might look suspcious, but perhaps the junk could be worked into an image URL like
. If thatâs not possible now maybe it will be later.git only uses sha1 because theyâre stuck with it: migrating is very hard. There was an effort to move git to sha256 but I donât know its status. I think there is progress being made with Game Of Trees, a git clone that uses the same on-disk format.
I canât imagine any benefit to using sha1, except that maybe some very old software might support sha1 but not sha256.
@movq@www.uninformativ.de going a little sideways on this, â*If twtxt/Yarn was to grow bigger, then this would become a concern again. But even Mastodon allows editing, so how much of a problem can it really be? đ *â, wouldnât it preparing for a potential (even if very, very, veeeeery remote) growth be a good thing? Mastodon signs all messages, keeps a history of edits, and it doesnât break threads. It isnât a problem there.đ It is here.
I think keeping hashes is a must. If anything for that âfeels goodâ feeling.
@movq@www.uninformativ.de Agreed that hashes have a benefit. I came up with a similar example where when I twted about an 11-character hash collision. Perhaps hashes could be made optional somehow. Like, you could use the âreplytoâ idea and then additionally put a hash somewhere if you want to lock in which version of the twt you are replying to.
There is nothing wrong with how we currently run a diff to see what has been removed. if i build a merkle tree off all the twt hashes in a feed i can use that to verify a twt should be in a feed or not. and gossip that to my peers.
isnât the benefit of blake2b that it is a more efficient algo than sha1 and has the same or similar entropy to sha3? i thought we had partially solved this with some type of expanding hash size? additionally we could increase bit density by using base36 or base64/url-safeâŚ
Thereâs a simple reason all the current hashes end in a or q: the hash is 256 bits, the base32 encoding chops that into groups of 5 bits, and 256 isnât divisible by 5. The last character of the base32 encoding just has that left-over single bit (256 mod 5 = 1).
So I agree with #3 below, but do you have a source for #1, #2 or #4? I would expect any lack of variability in any part of a hash functionâs output would make it more vulnerable to attacks, so designers of hash functions would want to make the whole output vary as much as possible.
Other than the divisible-by-5 thing, my current intuition is it doesnât matter what part you take.
Hash Structure: Hashes are typically designed so that their outputs have specific statistical properties. The first few characters often have more entropy or variability, meaning they are less likely to have patterns. The last characters may not maintain this randomness, especially if the encoding method has a tendency to produce less varied endings.
Collision Resistance: When using hashes, the goal is to minimize the risk of collisions (different inputs producing the same output). By using the first few characters, you leverage the full distribution of the hash. The last characters may not distribute in the same way, potentially increasing the likelihood of collisions.
Encoding Characteristics: Base32 encoding has a specific structure and padding that might influence the last characters more than the first. If the data being hashed is similar, the last characters may be more similar across different hashes.
Use Cases: In many applications (like generating unique identifiers), the beginning of the hash is often the most informative and varied. Relying on the end might reduce the uniqueness of generated identifiers, especially if a prefix has a specific context or meaning.
@prologic@twtxt.net I saw those, yes. I tried using yarnc, and it would work for a simple twtxt. Now, for a more convoluted one it truly becomes a nightmare using that tool for the job. I know there are talks about changing this hash, so this might be a moot point right now, but it would be nice to have a tool that:
- Would calculate the hash of a twtxt in a file.
- Would calculate all hashes on a
twtxt.txt(local and remote).
Again, something lovely to have after any looming changes occur.
Could someone knowledgable reply with the steps a grandpa will take to calculate the hash of a twtxt from the CLI, using out-of-the-box tools? I swear I read about it somewhere, but canât find it.
@falsifian@www.falsifian.org based on Twt Subject Extension, your subject is invalid. You can have custom subjects, that is, not a valid hash, but you simply canât put anything, and expect it to be treated as a TwtSubject, me thinks.
yarnd just doesnât render the subject. Fair enough. Itâs (replyto http://darch.dk/twtxt.txt 2024-09-15T12:50:17Z), and if you donât want to go on a hunt, the twt hash is weadxga: https://twtxt.net/twt/weadxga
@sorenpeter@darch.dk I like this idea. Just for fun, Iâm using a variant in this twt. (Also because Iâm curious how it non-hash subjects appear in jenny and yarn.)
URLs can contain commas so I suggest a different character to separate the url from the date. Is this twt Iâve used space (also after âreplytoâ, for symmetry).
I think this solves:
- Changing feed identities: although @mckinley@twtxt.net points out URLs can change, I think this syntax should be okay as long as the feed at that URL can be fetched, and as long as the current canonical URL for the feed lists this one as an alternate.
- editing, if you donât care about message integrity
- finding the root of a thread, if youâre not following the author
An optional hash could be added if message integrity is desired. (E.g. if you donât trust the feed author not to make a misleading edit.) Other recent suggestions about how to deal with edits and hashes might be applicable then.
People publishing multiple twts per second should include sub-second precision in their timestamps. As you suggested, the timestamp could just be copied verbatim.
@prologic@twtxt.net I have some ideas:
- Add smartypants rendering, just like Yarn has.
- Add the ability to create individual twtxts, each named after their hash.
- Fix the formatting of the help. :-P
Maybe Iâm being a bit too purist/minimalistic here. As I said before (in one of the 1372739 posts on this topic â or maybe I didnât even send that twt, I donât remember đ ), I never really liked hashes to begin with. They arenât super hard to implement but they are kind of against the beauty of the original twtxt â because you need special client support for them. Itâs not something that you could write manually in your
twtxt.txtfile. With @sorenpeter@darch.dkâs proposal, though, that would be possible.
Tangentially related, I was a bit disappointed to learn that the twt subject extension is now never used except with hashes. Manually-written subjects sounded so beautifully ad-hoc and organic as a way to disambiguate replies. Maybe Iâll try it some time just for fun.
(#w4chkna) @falsifian@www.falsifian.org You mean the idea of being able to inline
# url =changes in your feed?
Yes, that one. But @lyse@lyse.isobeef.org pointed out suffers a compatibility issue, since currently the first listed url is used for hashing, not the last. Unless your feed is in reverse chronological order. Heh, I guess another metadata field could indicate which version to use.
Or maybe url changes could somehow be combined with the archive feeds extension? Could the url metadata field be local to each archive file, so that to switch to a new url all you need to do is archive everything youâve got and start a new file at the new url?
I donât think itâs that likely my feed url will change.
@movq@www.uninformativ.de I figured it will be something like this, yet, you were able to reply just fine, and I wasnât. Looking at your twtxt.txt I see this line:
2024-09-16T17:37:14+00:00 (#o6dsrga) @<prologic https://twtxt.net/user/prologic/twtxt.txt>
@<quark https://ferengi.one/twtxt.txt> This is what I get. đ¤
Which is using the right hash. Mine, on the other hand, when I replied to the original, old style message (Message-Id: <o6dsrga>), looks like this:
2024-09-16T16:42:27+00:00 (#o) @<prologic https://twtxt.net/user/prologic/twtxt.txt> this was your first twtxt. Cool! :-P
What did you do to make yours work? I simply went to the oldest @prologic@twtxt.netâs entry on my Maildir, and replied to it (jenny set the reply-to hash to #o, even though the Message-Id is o6dsrga). Since jenny canât fetch archived twtxts, how could I go to re-fetch everything? And, most importantly, would re-fetching fix the Message-Id:?
Hmm⌠I replied to this message:
From: prologic <prologic>
Subject: Hello World! đ
Date: Sat, 18 Jul 2020 08:39:52 -0400
Message-Id: <o6dsrga>
X-twtxt-feed-url: https://twtxt.net/user/prologic/twtxt.txt
Hello World! đ
And see how the hash shows⌠Is it because that hash isnât longer used?
@aelaraji@aelaraji.com no, it is not just you. Do fetch the parent with jenny, and you will see there are two messages with different hash. Soren did something funky, for sure.
@prologic@twtxt.net Brute force. I just hashed a bunch of versions of both tweets until I found a collision.
I mostly just wanted an excuse to write the program. I donât know how I feel about actually using super-long hashes; could make the twts annoying to read if you prefer to view them untransformed.
@prologic@twtxt.net earlier you suggested extending hashes to 11 characters, but hereâs an argument that they should be even longer than that.
Imagine I found this twt one day at https://example.com/twtxt.txt :
2024-09-14T22:00Z Useful backup command: rsync -a â$HOMEâ /mnt/backup
and I responded with â(#5dgoirqemeq) Thanks for the tip!â. Then Iâve endorsed the twt, but it could latter get changed to
2024-09-14T22:00Z Useful backup command: rm -rf /some_important_directory
which also has an 11-character base32 hash of 5dgoirqemeq. (Iâm using the existing hashing method with https://example.com/twtxt.txt as the feed url, but Iâm taking 11 characters instead of 7 from the end of the base32 encoding.)
Thatâs what I meant by âspoofingâ in an earlier twt.
I donât know if preventing this sort of attack should be a goal, but if it is, the number of bits in the hash should be at least two times log2(number of attempts we want to defend against), where the âtwo timesâ is because of the birthday paradox.
Side note: current hashes always end with âaâ or âqâ, which is a bit wasteful. Maybe we should take the first N characters of the base32 encoding instead of the last N.
Code I used for the above example: https://fossil.falsifian.org/misc/file?name=src/twt_collision/find_collision.c
I only needed to compute 43394987 hashes to find it.
I was not suggesting to that everyone need to setup a working webfinger endpoint, but that we take the format of nick+(sub)domain as base for generating the hashed together with the message date and content.
If we omit the protocol prefix from the way we do things now will that not solve most of the problems? In the case of gemini://gemini.ctrl-c.club/~nristen/twtxt.txt they also have a working twtxt.txt at https://ctrl-c.club/~nristen/twtxt.txt ⌠damn I just notice the gemini. subdomain.
Okay what about defining a prefers protocol as part of the hash schema? so 1: https , 2: http 3: gemini 4: gopher ?
@sorenpeter@darch.dk There was a client that would generate a unique hash for each twt. It didnât get wide adoption.
So this is a great thread. I have been thinking about this too.. and what if we are coming at it from the wrong direction? Identity being tied to a given URL has always been a pain point. If i get a new URL its almost as if i have a new identity because not only am I serving at a new location but all my previous communications are broken because the hashes are all wrong.
What if instead we used this idea of signatures to thread the URLs together into one identity? We keep the URL to Hash in place. Changing that now is basically a no go. But we can create a signature chain that can link identities together. So if i move to a new URL i update the chain hosted by my primary identity to include the new URL. If i have an archived feed that the old URL is now dead, we can point to where it is now hosted and use the current convention of hashing based on the first url:
The signature chain can also be used to rotate to new keys over time. Just sign in a new key or revoke an old one. The prior signatures remain valid within the scope of time the signatures were made and the keys were active.
The signature file can be hosted anywhere as long as it can be fetched by a reasonable protocol. So say we could use a webfinger that directs to the signature file? you have an identity like frank@beans.co that will discover a feed at some URL and a signature chain at another URL. Maybe even include the most recent signing key?
From there the client can auto discover old feeds to link them together into one complete timeline. And the signatures can validate that its all correct.
I like the idea of maybe putting the chain in the feed preamble and keeping the single self contained file.. but wonder if that would cause lots of clutter? The signature chain would be something like a log with what is changing (new key, revoke, add url) and a signature of the change + the previous signature.
# chain: ADDKEY kex14zwrx68cfkg28kjdstvcw4pslazwtgyeueqlg6z7y3f85h29crjsgfmu0w
# sig: BEGIN SALTPACK SIGNED MESSAGE. ...
# chain: ADDURL https://txt.sour.is/user/xuu
# sig: BEGIN SALTPACK SIGNED MESSAGE. ...
# chain: REVKEY kex14zwrx68cfkg28kjdstvcw4pslazwtgyeueqlg6z7y3f85h29crjsgfmu0w
# sig: ...
how little data is needed for generating the hashes? Instead of the full URL, can we makedo with just the domain (example.net) so we avoid the conflicts with gemini://, https:// and only http:// (like in my own twtxt.txt) or construct something like like a webfinger id nick@domain (also used by mastodon etc.) from the domain and nick if there, else use domain as nick as well
@lyse@lyse.isobeef.org This looks like a nice way to do it.
Another thought: if clients canât agree on the url (for example, if we switch to this new way, but some old clients still do it the old way), that could be mitigated by computing many hashes for each twt: one for every url in the feed. So, if a feed has three URLs, every twt is associated with three hashes when it comes time to put threads together.
A client stills need to choose one url to use for the hash when composing a reply, but this might add some breathing room if thereâs a period when clients are doing different things.
(From what I understand of jenny, this would be difficult to implement there since each pseudo-email can only have one msgid to match to the in-reply-to headers. I donât know about other clients.)
@movq@www.uninformativ.de Another idea: just hash the feed url and time, without the message content. And donât twt more than once per second.
Maybe you could even just use the time, and rely on @-mentions to disambiguate. Not sure how that would work out.
Though I kind of like the idea of twts being immutable. At least, itâs clear which version of a twt youâre replying to (assuming nobody is engineering hash collisions).
@movq@www.uninformativ.de @prologic@twtxt.net Another option would be: when you edit a twt, prefix the new one with (#[old hash]) and some indication that itâs an edited version of the original tweet with that hash. E.g. if the hash used to be abcd123, the new version should start â(#abcd123) (redit)â.
What I like about this is that clients that donât know this convention will still stick it in the same thread. And I feel itâs in the spirit of the old pre-hash (subject) convention, though thatâs before my time.
I guess it may not work when the edited twt itself is a reply, and there are replies to it. Maybe that could be solved by letting twts have more than one (subject) prefix.
But the great thing about the current system is that nobody can spoof message IDs.
I donât think twtxt hashes are long enough to prevent spoofing.
@prologic@twtxt.net One of your twts begins with (#st3wsda): https://twtxt.net/twt/bot5z4q
Based on the twtxt.net web UI, it seems to be in reply to a twt by @cuaxolotl@sunshinegardens.org which begins âIâve been sketching outâŚâ.
But jenny thinks the hash of that twt is 6mdqxrq. At least, thereâs a very twt in their feed with that hash that has the same text as appears on yarn.social (except with â instead of â).
Based on this, it appears jenny and yarnd disagree about the hash of the twt, or perhaps the twt was edited (though I canât see any difference, assuming â vs â is just a rendering choice).
@prologic@twtxt.net How does yarn.socialâs API fix the problem of centralization? I still need to know whose API to use.
Say I see a twt beginning (#hash) and I want to look up the start of the thread. Is the idea that if that twt is hosted by a a yarn.social pod, it is likely to know the thread start, so I should query that particular pod for the hash? But what if no yarn.social pods are involved?
The community seems small enough that a registry server should be able to keep up, and I can have a couple of others as backups. Or I could crawl the list of feeds followed by whoever emitted the twt that prompted my query.
I have successfully used registry servers a little bit, e.g. to find a feed that mentioned a tag I was interested in. Was even thinking of making my own, if I get bored of my too many other projects :-)