Refactoring social media technology
The Internet and social media have brought profound changes to our daily lives, how we communicate within our communities and society, and, of course, our privacy. Unfortunately, these changes have not always been for the better. I don’t want to go into the impact of social media too much in this post. Watch the documentary ‘The Social Network’ on Netflix for more info on the effects of social media. Regarding privacy, I think we all know how much our privacy is at risk from the practices of the ad industry, other companies, and even some governments.
I want to discuss a solution to refactor two technologies that form the backbone of social media and other websites and apps:
1: Social media sites are powered by Graph databases that contain information about us, the people we are friends, family, or colleagues with, the videos, posts, or music we like, etc., etc.
2: Data lakes store information about all our interactions with their websites, apps, and devices.
What I’d like to accomplish is to pry our data out of the hands of these companies. If they no longer control our data, we are empowered to make our own decisions about what data to share and with who we want to share it. Furthermore, we get to decide what data we want to keep and what to delete. For this purpose, I have developed the Byoda software. One of the components of Byoda is a variant of the personal data store. As we already have too many acronyms in our industry, I’d like to call it: our data pods. You can run your own pod either in your home network or in the public cloud. You only need a single pod to store your data from all the services you use.
My assumption is that it is possible for each website or app out there to create a single, consolidated, detailed, and up-to-date technical specification for the data model they use to store data about each of us. I can’t see any website arguing against that, as it would mean they would admit they don’t know what their various dev-teams are doing. There would likely be frequent updates to that specification, but we can develop our technology to deal with that.
Byoda uses this technical specification as a ‘data contract’ by including access controls; different actors such as ‘member’ (you), ‘service,’ and ‘network’ would be permitted to Create, Read, Update, and Delete (CRUD) actions on the various data elements specified in the contract. The data contract tells your pod what data it will store for the service and who should have access to that data. You have to accept this data contract when you decide to join a service, and the enforcement of the data contract is implemented by your pod, acting as a data firewall.
The first technical challenge is the enforcement of the data contract. I’m using JSON-Schema with some additional mark-up for this. The pod parses the data contract for the service you joined and knows what data to expect and who should have access to the data.
The second technical challenge is that we need a storage technology that stores all our data from various websites in a single pod without knowing beforehand what data models the pod needs to support. As we use JSON-Schema, the data is stored in the JSON format, making it trivial to store our data for each website in one or more JSON documents. The JSON documents can be stored in a Key/Value-store or a NoSQL database.
The third technical challenge is how to make the data accessible while we don’t know beforehand what data will be accessed from the pod. I believe REST APIs are ill-suited here as either they would be very generic, or they would have to somehow be auto-generated based on the specification of the data model of the website. Here I’ve chosen to programmatically convert the JSON-Schema into GraphQL. With GraphQL, developers of websites and apps can query the data for that service in our pods. I believe the GraphQL API will be sufficient for accessing the data in the pod, while REST APIs are implemented to manage the pod.
The JSON document does not just store information about you but can also be used to store information about your network relations and event logs. This allows the collection of all running data pods to act as a Graph database and as a data lake. One of the permissions that we support in the JSON Schema is ‘append,’ which can be used to allow the service to store event data in your pod. Another permission is ‘search,’ which can be used to filter data stored in your pod. Together, these two permissions should provide sufficient functionality to replace the centralized graph and data lake platforms.
One issue I haven’t designed yet is how to ‘walk the network,’ for example, how to find the names of the friends of your friends. There are a couple of possibilities:
- The service can get identifiers for your friends from your pod and then query the pods of your friends for their names. Because of the number of queries to pods needed, the latency of the request would be an order of magnitude higher compared with querying a Graph database. The request would also be an order of magnitude more expensive computationally for the service, but, in exchange, they would no longer have the cost of storing the data now stored in the pods.
- The service can do a recursive query against a pod for the pod to query other pods. This can be implemented as a GraphQL query with a ‘resolver’ function that queries other pods, or a REST API could be implemented that takes the GraphQL query to execute as input. When the pod receives the request, it could issue a similar REST API call to the pods of the friends.
For now, the first option will work. We’ll likely run into scaling issues as the number of running pods grows, but we can develop a solution before this becomes a real problem.
If you’d like to get started with Byoda, you can install your own data pod. The instructions on getting started are on the Byoda Github page.