1 | # Writing Shims to the Internet Archive and to the dweb-transports library
|
2 |
|
3 | Second draft: Mitra 19 Dec 2018
|
4 |
|
5 | Our intention with the dweb-transports and dweb-archive libraries is to be available for integrating with any decentralized platform (what we call a transport),
|
6 | and this guide is intended to help the process.
|
7 |
|
8 | In our experience the process of adding a transport to a platform is pretty easy **when** we collaborate with someone intimately familiar with the platform.
|
9 | So feel free to ask questions in the [dweb-transports](https://github.com/internetarchive/dweb-transports/issues) repo,
|
10 | and to reach out to [Mitra](mitra@archive.org) for assistance.
|
11 |
|
12 | If you are working on integration, please add a comment to [dweb-transports issue#10](https://github.com/internetarchive/dweb-transports/issues/10)
|
13 |
|
14 | All the repos are open-source, `dweb-objects` for example refers to [https://github.com/internetarchive/dweb-objects]
|
15 |
|
16 | ## Overview
|
17 |
|
18 | Integrating a Dweb platform (aka Transport) into this library has two main stages.
|
19 |
|
20 | 1. integration into the [dweb-transports](https://github.com/internetarchive/dweb-transports) repo,
|
21 | which mostly involves writing a file with a name like TransportXYZ.js
|
22 | and integrating in a couple of places. This can be done entirely by a third party,
|
23 | though it will work smoother with collaboration.
|
24 | 2. integrating a shim that enables the Internet Archive's content to be available
|
25 | in the decentralized platform either via the [dweb.archive.org](https://dweb.archive.org) UI or otherwise.
|
26 | This is only necessary if you want to make IA content available,
|
27 | and will require our assistance to integrate with code that runs on IA servers.
|
28 |
|
29 | ## Integration into the [dweb-transports](https://github.com/internetarchive/dweb-transports) repo
|
30 |
|
31 | ### Building TransportXYZ.js
|
32 |
|
33 | The main code sits in a file named something like TransportXYZ.js.
|
34 |
|
35 | In this file are implementations of:
|
36 | * Chunks - storing and retrieving opaque data as a chunk or via a stream.
|
37 | * KeyValues - setting and getting the value of a key in a table.
|
38 | * Lists - append only logs.
|
39 |
|
40 | See [API.md](./API.md) and the existing code examples for detailed function by function documentation.
|
41 |
|
42 | #### Error handling
|
43 | One common problem with decentralized platforms is reliability. We handle this by falling back from one platform to another,
|
44 | e.g. if IPFS fails we can try WEBTORRENT or HTTP. But this only works if the Transports.js layer can detect when a failure has occurred.
|
45 | This means it is really important to return an error (via a throw, promise rejection, or callback)
|
46 |
|
47 | #### Promises or callbacks
|
48 | We've tried to suport both promises and callbacks, though this isn't complete yet.
|
49 | In general it will work best if each outward facing function supports a `cb(err, res)` parameter, and where this is absent, a Promise is returned that will `resolve` to `res` or `reject` with `err`.
|
50 |
|
51 | The `p_foo()` naming convention was previously used to indicate which functions returned a Promise and is gradually being phased out.
|
52 |
|
53 | ### Integration other than TransportXYZ.js
|
54 |
|
55 | Searching dweb-transports for `SEE-OTHER-ADDTRANSPORT` should find any places in the code where a tweak is required to add a new transport.
|
56 |
|
57 | The current list of places to integrate includes:
|
58 |
|
59 | * [index.js](./index.js): needs to require the new TransportXYZ
|
60 | * [package.json/dependencies](./package.json#L13): Should specify which version range of a transport to include
|
61 | * [API.md](./API.md): Has overview documentation
|
62 | * [Transports.js](./Transports.js#L78): Add a function like: http(), gun() etc: allow finding loaded transports (for example can be used by one transport to find another).
|
63 | * [Transports.js/p_connect](./Transports.js#L625): Add to list so it connects by default at startup
|
64 | * [dweb-archive/Util.config](https://github.com/internetarchive/dweb-archive/blob/master/Util.js#L135)
|
65 |
|
66 | #### Partial implementation.
|
67 |
|
68 | Its perfectly legitimate to only implement the parts of the API that the underlying platform implements,
|
69 | though it will work better if the others are implemented as well,
|
70 | for example:
|
71 | * a list can be implemented on top of a KeyValue system by adding a new item with a key being a timestamp.
|
72 | * key-value can be implemented on top of lists, by appending a {key: value} data structure, and filtering on retrieval.
|
73 |
|
74 | **monitor** and **listmonitor** will only work if the underlying system supports them, and its perfectly reasonable not to implement them.
|
75 | They aren't currently used by the dweb-archive / dweb.archive.org code.
|
76 |
|
77 | Make sure that `TransportXYZ.js` `constructor()` correctly covers what functions are implemented in the `.supportFunctions` field.
|
78 | This field is used by Transports to see which transports to try for which functionality.
|
79 |
|
80 | For example if "store" is listed in TransportXYZ.supportFunctions,
|
81 | then a call to Transports.p_rawstore() will attempt to store using XYZ,
|
82 | and add whatever URL `TransportXYZ.p_rawstore()` returns to the array of URLs where the content is stored.
|
83 |
|
84 | ## Integration into the Archive's servers.
|
85 |
|
86 | Integration into the Archive content will definately require a more in-depth collaboation,
|
87 | but below is an outline.
|
88 |
|
89 | The key challenge is that the Archive has about 50 petabytes of data,
|
90 | and none of the distributed platforms can pratically handle that currently.
|
91 | So we use 'lazy-seeding' techniques to push/pull data into a platform as its requested by users.
|
92 | Optionally, if the process of adding a, possibly large, item is slow (e.g. in IPFS, WEBTORRENT), we can also crawl some subset of Archive resources and pre-seed those files to the platform.
|
93 |
|
94 | In all cases, we presume that we run a (potentially) modified peer at the Archive,
|
95 | so that interaction between the Archive servers and the system is fast and bandwidth essentially free.
|
96 | We call this peer a "SuperPeer"
|
97 |
|
98 | In case its useful .... our servers have:
|
99 | * A persistent volume available to each peer at e.g. /pv/gun
|
100 | * An implementation of REDIS answering on 0.0.0.0:6379 which saves to the persistent volume
|
101 | * A HTTPS or WSS proxy (we prefer this over giving access to dweb.me's certificate to the superpeer)
|
102 | * Log files (including rotation)
|
103 | * cron (not currently used, but can be)
|
104 |
|
105 | These are available to superpeers but will require some liason so we know how they are being used.
|
106 |
|
107 | ### Conventions
|
108 | Please follow conventions i.e.
|
109 | * Code location `/usr/local/<repo-name>` e.g. `/usr/local/dweb-transports`
|
110 | * Persistent volume `/pv/<transportname>` e.g. `/pv/gun`
|
111 | * Log files `/var/log/dweb/dweb-<transportname>` e.g. `/var/log/dweb/dweb-gun`
|
112 |
|
113 | ### Options for integration: Hijack, Push, Hybrid
|
114 |
|
115 | The actual choices to be made will depend on some of the differences between transports, specifically.
|
116 | * Is data immutable, and refered to by a content address or hash (IPFS, WEBTORRENT), or is it mutable and refered to by a name. (GUN, YJS, FLUENCE)
|
117 | * Will it be easier to
|
118 | 1. 'hijack' specific addresses and use the peer to initiate retrieval from our servers (GUN)
|
119 | 2. Have the server Push data into the platform and share the hash generated by the platform in the metadata (IPFS) and/or pass a URL to the platform which it can pull and return its hash.
|
120 | 3. Hybrid - precalculate content addresses during item creation, then hijack the request for the data (this is expensive for the Archive so is going to take a lot longer to setup). (WEBTORRENT)
|
121 |
|
122 | Each of these requires a different technique, the documentation below currently only covers metadata access for material addressed by name.
|
123 |
|
124 | #### 1. Hijacking
|
125 |
|
126 | For hijacking, currently used by GUN, the peer implements in its code,
|
127 | a way to map from a specific address to an action, with the simplest being a URL access.
|
128 |
|
129 | We think that hijacking is a generically useful function
|
130 | that allows a decentralized system to coexist with legacy (centralized) data
|
131 | and be able to cache and share it in a decentralized fashion prior to an abiliy to absorb all the data on the platform.
|
132 |
|
133 | Obviously this could run quite complex functionality but in may cases simple mapping to URLs on our gateway will work well.
|
134 |
|
135 | See [dweb-transport/gun/gun_https_hijackable.js](https://github.com/internetarchive/dweb-transport/blob/master/gun/gun_https_hijackable.js) for the code modification
|
136 | and `[gun_https_archive.js](https://github.com/internetarchive/dweb-transport/blob/master/gun/gun_https_archive.js)` for the configuration that maps `/arc/archive/metadata` to `https://dweb.me/arc/archive.org/metadata/` so that for example
|
137 | `gun:/arc/archive/metadata/commute` retrieves metadata for the `commute` Internet Archive item at [https://dweb.me/arc/archive.org/metadata/commute].
|
138 |
|
139 | This will also work if the address of the table is a hash for example `xyz:/xyz/Q1234567/commute`
|
140 | where `Q1234567` would be `xyz`'s address for the metadata table.
|
141 | The mapping to that table's address can be hard-coded in code, or included in the dweb-transports/Naming.js resolution.
|
142 |
|
143 | The dweb-archive code needs to know to try Gun for the metadata, and this is configured in [./Naming.js]
|
144 | Note that this configuration mechanism is likely to change in the future though the address (on GUN) checked should remain the same.
|
145 |
|
146 | File retrieval can work similarly if the platform allows addressing by name.
|
147 | For example gun:/arc/archive/download could be mapped to https://dweb.me/arc/archive.org/download so that gun:/arc/archive/download/commute/commute.avi
|
148 | would resolve. Similarly the mapping could be to an opaque hash-based address like `xyz:/xyz/Q99999/commute/commute.avi` works.
|
149 | In this case the Archive client would be configured to automatically add a transformed URL like this as one of the places to look for a file.
|
150 |
|
151 | #### 2. Push of URL mapping (prefered) or content.
|
152 |
|
153 | This is more complex, and can only integrate files access, not metadata.
|
154 |
|
155 | The general path is that a client requests metadata (via HTTP or GUN currently),
|
156 | the dweb-gateway server then passes a URL to the platform (IPFS) which retrieves the URL,
|
157 | calculates its hash (which is a hash of the internal data structure (IPLD)) and passes
|
158 | that to the server. The server incorporates it into the metadata returned.
|
159 |
|
160 | It is less preferably to Hijacking, in part because the first metadata query is
|
161 | delayed while the platform is retrieving and processing a potentially large file in order to
|
162 | generate its internal address for it.
|
163 | This is likely to be neccessary if the platform uses content addressing,
|
164 | especially if it uses an internally generated address (for example IPFS uses a multihash of an internal 'IPLD' object).
|
165 |
|
166 | This is used for IPFS.
|
167 |
|
168 | We will need a HTTP API, and a snippet of code (currently only python is supported) that we can integrate.
|
169 |
|
170 | It should have a signature like:
|
171 | ```
|
172 | def store(self, data=None, # If passed, this data will be pushed
|
173 | urlfrom=None, # The URL at which the superpeer can access the data, note this URL may not be accessible to other peers
|
174 | verbose=False, # Generate debugging info
|
175 | mimetype=None, # Can be passed to the superpeer if required by its HTTP API
|
176 | pinggateway=True, # On some platforms (IPFS) we can optionally ping a (central) address to encourage propogation
|
177 | **options): # Catchall for other future options
|
178 | ```
|
179 | and should return a string that is the URL to be used for access, e.g. `ipfs:/ipfs/Q12345`
|
180 |
|
181 | We'll need to integrate it into `dweb-gateway` in [Archive.py.item2thumbnail()](https://github.com/internetarchive/dweb-objects/blob/master/Archive.py#L360]
|
182 | and [NameResolver.py/cache_content()](https://github.com/internetarchive/dweb-objects/blob/master/NameResolver.py#L222)
|
183 |
|
184 | #### 3. Hybrid - Precalculate + hijack.
|
185 |
|
186 | For WebTorrent we have done a much more complex process which we dont want to do again if possible.
|
187 | At least until some platform is already operating at scale.
|
188 | However there may be some hints in its structure at options for superpeers.
|
189 |
|
190 | It involves:
|
191 |
|
192 | * The torrent magnet links are calculated when as we add items to the Archive, and have been batch run on the entire archive (expensive!) and indexed.
|
193 | * The torrents include pointers to a superpeer Tracker
|
194 | * Those links are added into the metadata in `ArchiveItem.new()` and `ArchiveFile.new()`
|
195 | * The superpeer Tracker pretends that any magnet link is available at the Seeder
|
196 | * The seeder access a specific URL like btih/12345
|
197 | * The gateway looks up the BTIH and returns a torrent file
|
198 | * The seeder uses the torrent file to fetch and return the required data
|
199 |
|
200 | ### Installation for testing
|
201 |
|
202 | To make this work we'll need ...
|
203 | * Pull request on dweb-objects, dweb-transports.
|
204 | * Access to a repo (or branch) for the platform that has the hijacking code, this can be
|
205 | either a separate repo or a pull request on dweb-transport where you can take over a directory (GUN does this).
|
206 |
|
207 | ### Installation for production integration
|
208 |
|
209 | We'll then need some info to help us integrate in our Docker/Kubernates production system.
|
210 | Sorry, but this isn't currently in an open repo since its tied into our CI system. The content will include:
|
211 |
|
212 | * Any one-time instructions to run in `superv`.
|
213 | Note these are run each time a dockerfile starts so need to be safe to run multiple times e.g.
|
214 | ```
|
215 | # gun setup
|
216 | mkdir -p -m777 /pv/gun
|
217 | ```
|
218 |
|
219 | * Any ports that need exposing or mapping to go in `chart/templates/template.yaml` e.g.
|
220 | ```
|
221 | - name: wsgun
|
222 | containerPort: 4246
|
223 | ```
|
224 | and in `chart/templates/service.yaml`
|
225 | ```
|
226 | - port: 4246
|
227 | targetPort: 4246
|
228 | name: wsgun
|
229 | protocol: TCP
|
230 | ```
|
231 | and in ports-unblock.sh
|
232 | ```
|
233 | proto tcp dport 4246 ACCEPT; # GUN websockets port
|
234 | ```
|
235 | * Startup info to go in `supervisor.conf` e.g.
|
236 | ```
|
237 | [program:dweb-gun]
|
238 | command=node /usr/local/dweb-transport/gun/gun_https_archive.js 4246
|
239 | directory = /pv/gun
|
240 | stdout_logfile = /var/log/dweb/dweb-gun
|
241 | stdout_logfile_maxbytes=500mb
|
242 | redirect_stderr = True
|
243 | autostart = True
|
244 | autorestart = True
|
245 | environment=GUN_ENV=false
|
246 | exitcodes=0
|
247 | ```
|
248 | * Docker setup
|
249 | You can presume the the docker has the following before your install,
|
250 | this should be pretty standard for NodeJS, Python3 or Go applications, and you can add other packages you need.
|
251 | ```
|
252 | FROM ubuntu:rolling
|
253 | RUN apt-get -y update && apt-get -y install redis-server supervisor zsh git python3-pip curl sudo nginx python3-nacl golang nodejs npm cron
|
254 | COPY . /app/
|
255 | COPY etc /etc/
|
256 | RUN mkdir -p /var/log/dweb
|
257 | ```
|
258 | Typically your code for integrating into Docker would then look something like the following nodejs example (Go and Python3 examples on request)
|
259 | ```
|
260 | RUN apt-get -y install anyOtherPackagesYouNeed
|
261 | RUN cd /usr/local && git clone https://github.com/<yourRepo || internetarchive/dweb-transport> \
|
262 | && cd /usr/local/<yourRepo> && npm install
|
263 | && ln -s /pv/xyz /usr/local/<yourRepo>/someinternaldirectory
|
264 | #Setup any cron call if required
|
265 | RUN echo '3 * * * * root node /usr/local/yourRepo/cron_hourly.js' > /etc/cron.d/dweb
|
266 | ENV XYZABC="some environment info you need"
|
267 | # Expose any defaults you need
|
268 | EXPOSE 1234
|
269 | ```
|