UNPKG

17.1 kBMarkdownView Raw
1# Pose Detection in the Browser: PoseNet Model
2
3## Note: We've just released Version 2.0 with a **new ResNet** model and API. Check out the new documentation below.
4
5This package contains a standalone model called PoseNet, as well as some demos, for running real-time pose estimation in the browser using TensorFlow.js.
6
7[Try the demo here!](https://storage.googleapis.com/tfjs-models/demos/posenet/camera.html)
8
9<img src="demos/camera.gif" alt="cameraDemo" style="width: 600px;"/>
10
11PoseNet can be used to estimate either a single pose or multiple poses, meaning there is a version of the algorithm that can detect only one person in an image/video and one version that can detect multiple persons in an image/video.
12
13[Refer to this blog post](https://medium.com/tensorflow/real-time-human-pose-estimation-in-the-browser-with-tensorflow-js-7dd0bc881cd5) for a high-level description of PoseNet running on Tensorflow.js.
14
15To keep track of issues we use the [tensorflow/tfjs](https://github.com/tensorflow/tfjs) Github repo.
16
17## Documentation Note
18
19>> The README you see here is for the [PoseNet 2.0 version](https://www.npmjs.com/package/@tensorflow-models/posenet). For README of the previous 1.0 version, please look at the [README published on NPM](https://www.npmjs.com/package/@tensorflow-models/posenet/v/1.0.3).
20
21## Installation
22
23You can use this as standalone es5 bundle like this:
24
25```html
26 <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs"></script>
27 <script src="https://cdn.jsdelivr.net/npm/@tensorflow-models/posenet"></script>
28```
29
30Or you can install it via npm for use in a TypeScript / ES6 project.
31
32```sh
33npm install @tensorflow-models/posenet
34```
35
36## Usage
37
38Either a single pose or multiple poses can be estimated from an image.
39Each methodology has its own algorithm and set of parameters.
40
41
42### Loading a pre-trained PoseNet Model
43
44In the first step of pose estimation, an image is fed through a pre-trained model. PoseNet **comes with a few different versions of the model,** corresponding to variances of MobileNet v1 architecture and ResNet50 architecture. To get started, a model must be loaded from a checkpoint:
45
46```javascript
47const net = await posenet.load();
48```
49
50By default, `posenet.load()` loads a faster and smaller model that is based on MobileNetV1 architecture and has a lower accuracy. If you want to load the larger and more accurate model, specify the architecture explicitly in `posenet.load()` using a `ModelConfig` dictionary:
51
52
53#### MobileNet (smaller, faster, less accurate)
54```javascript
55const net = await posenet.load({
56 architecture: 'MobileNetV1',
57 outputStride: 16,
58 inputResolution: { width: 640, height: 480 },
59 multiplier: 0.75
60});
61```
62
63#### ResNet (larger, slower, more accurate) \*\*new!\*\*
64```javascript
65const net = await posenet.load({
66 architecture: 'ResNet50',
67 outputStride: 32,
68 inputResolution: { width: 257, height: 200 },
69 quantBytes: 2
70});
71```
72
73#### Config params in posenet.load()
74
75 * **architecture** - Can be either `MobileNetV1` or `ResNet50`. It determines which PoseNet architecture to load.
76
77 * **outputStride** - Can be one of `8`, `16`, `32` (Stride `16`, `32` are supported for the ResNet architecture and stride `8`, `16`, `32` are supported for the MobileNetV1 architecture). It specifies the output stride of the PoseNet model. The smaller the value, the larger the output resolution, and more accurate the model at the cost of speed. Set this to a larger value to increase speed at the cost of accuracy.
78
79* **inputResolution** - A `number` or an `Object` of type `{width: number, height: number}`. Defaults to `257.` It specifies the size the image is resized and padded to before it is fed into the PoseNet model. The larger the value, the more accurate the model at the cost of speed. Set this to a smaller value to increase speed at the cost of accuracy. If a number is provided, the image will be resized and padded to be a square with the same width and height. If `width` and `height` are provided, the image will be resized and padded to the specified width and height.
80
81 * **multiplier** - Can be one of `1.01`, `1.0`, `0.75`, or `0.50` (The value is used *only* by the MobileNetV1 architecture and not by the ResNet architecture). It is the float multiplier for the depth (number of channels) for all convolution ops. The larger the value, the larger the size of the layers, and more accurate the model at the cost of speed. Set this to a smaller value to increase speed at the cost of accuracy.
82
83 * **quantBytes** - This argument controls the bytes used for weight quantization. The available options are:
84
85 - `4`. 4 bytes per float (no quantization). Leads to highest accuracy and original model size (~90MB).
86
87 - `2`. 2 bytes per float. Leads to slightly lower accuracy and 2x model size reduction (~45MB).
88 - `1`. 1 byte per float. Leads to lower accuracy and 4x model size reduction (~22MB).
89
90* **modelUrl** - An optional string that specifies custom url of the model. This is useful for local development or countries that don't have access to the model hosted on GCP.
91
92
93**By default,** PoseNet loads a MobileNetV1 architecture with a **`0.75`** multiplier. This is recommended for computers with **mid-range/lower-end GPUs.** A model with a **`0.50`** multiplier is recommended for **mobile.** The ResNet achitecture is recommended for computers with **even more powerful GPUs**.
94
95### Single-Person Pose Estimation
96
97Single pose estimation is the simpler and faster of the two algorithms. Its ideal use case is for when there is only one person in the image. The disadvantage is that if there are multiple persons in an image, keypoints from both persons will likely be estimated as being part of the same single pose—meaning, for example, that person #1’s left arm and person #2’s right knee might be conflated by the algorithm as belonging to the same pose. Both the MobileNetV1 and the ResNet architecture support single-person pose estimation. The method returns a **single pose**:
98
99```javascript
100const net = await posenet.load();
101
102const pose = await net.estimateSinglePose(image, {
103 flipHorizontal: false
104});
105```
106
107#### Params in estimateSinglePose()
108
109* **image** - ImageData|HTMLImageElement|HTMLCanvasElement|HTMLVideoElement
110 The input image to feed through the network.
111* **inferenceConfig** - an object containing:
112 * **flipHorizontal** - Defaults to false. If the pose should be flipped/mirrored horizontally. This should be set to true for videos where the video is by default flipped horizontally (i.e. a webcam), and you want the poses to be returned in the proper orientation.
113
114#### Returns
115
116It returns a `Promise` that resolves with a **single** `pose`. The `pose` has a confidence score and an array of keypoints indexed by part id, each with a score and position.
117
118#### Example Usage
119
120##### via Script Tag
121
122```html
123<html>
124 <head>
125 <!-- Load TensorFlow.js -->
126 <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs"></script>
127 <!-- Load Posenet -->
128 <script src="https://cdn.jsdelivr.net/npm/@tensorflow-models/posenet"></script>
129 </head>
130
131 <body>
132 <img id='cat' src='/images/cat.jpg '/>
133 </body>
134 <!-- Place your code in the script tag below. You can also use an external .js file -->
135 <script>
136 var flipHorizontal = false;
137
138 var imageElement = document.getElementById('cat');
139
140 posenet.load().then(function(net) {
141 const pose = net.estimateSinglePose(imageElement, {
142 flipHorizontal: true
143 });
144 return pose;
145 }).then(function(pose){
146 console.log(pose);
147 })
148 </script>
149</html>
150```
151
152###### via NPM
153
154```javascript
155import * as posenet from '@tensorflow-models/posenet';
156
157async function estimatePoseOnImage(imageElement) {
158 // load the posenet model from a checkpoint
159 const net = await posenet.load();
160
161 const pose = await net.estimateSinglePose(imageElement, {
162 flipHorizontal: false
163 });
164 return pose;
165}
166
167const imageElement = document.getElementById('cat');
168
169const pose = estimatePoseOnImage(imageElement);
170
171console.log(pose);
172
173```
174
175which would produce the output:
176
177```json
178{
179 "score": 0.32371445304906,
180 "keypoints": [
181 {
182 "position": {
183 "y": 76.291801452637,
184 "x": 253.36747741699
185 },
186 "part": "nose",
187 "score": 0.99539834260941
188 },
189 {
190 "position": {
191 "y": 71.10383605957,
192 "x": 253.54365539551
193 },
194 "part": "leftEye",
195 "score": 0.98781454563141
196 },
197 {
198 "position": {
199 "y": 71.839515686035,
200 "x": 246.00454711914
201 },
202 "part": "rightEye",
203 "score": 0.99528175592422
204 },
205 {
206 "position": {
207 "y": 72.848854064941,
208 "x": 263.08151245117
209 },
210 "part": "leftEar",
211 "score": 0.84029853343964
212 },
213 {
214 "position": {
215 "y": 79.956565856934,
216 "x": 234.26812744141
217 },
218 "part": "rightEar",
219 "score": 0.92544466257095
220 },
221 {
222 "position": {
223 "y": 98.34538269043,
224 "x": 399.64068603516
225 },
226 "part": "leftShoulder",
227 "score": 0.99559044837952
228 },
229 {
230 "position": {
231 "y": 95.082359313965,
232 "x": 458.21868896484
233 },
234 "part": "rightShoulder",
235 "score": 0.99583911895752
236 },
237 {
238 "position": {
239 "y": 94.626205444336,
240 "x": 163.94561767578
241 },
242 "part": "leftElbow",
243 "score": 0.9518963098526
244 },
245 {
246 "position": {
247 "y": 150.2349395752,
248 "x": 245.06030273438
249 },
250 "part": "rightElbow",
251 "score": 0.98052614927292
252 },
253 {
254 "position": {
255 "y": 113.9603729248,
256 "x": 393.19735717773
257 },
258 "part": "leftWrist",
259 "score": 0.94009721279144
260 },
261 {
262 "position": {
263 "y": 186.47859191895,
264 "x": 257.98034667969
265 },
266 "part": "rightWrist",
267 "score": 0.98029226064682
268 },
269 {
270 "position": {
271 "y": 208.5266418457,
272 "x": 284.46710205078
273 },
274 "part": "leftHip",
275 "score": 0.97870296239853
276 },
277 {
278 "position": {
279 "y": 209.9910736084,
280 "x": 243.31219482422
281 },
282 "part": "rightHip",
283 "score": 0.97424703836441
284 },
285 {
286 "position": {
287 "y": 281.61965942383,
288 "x": 310.93188476562
289 },
290 "part": "leftKnee",
291 "score": 0.98368924856186
292 },
293 {
294 "position": {
295 "y": 282.80120849609,
296 "x": 203.81164550781
297 },
298 "part": "rightKnee",
299 "score": 0.96947449445724
300 },
301 {
302 "position": {
303 "y": 360.62716674805,
304 "x": 292.21047973633
305 },
306 "part": "leftAnkle",
307 "score": 0.8883239030838
308 },
309 {
310 "position": {
311 "y": 347.41177368164,
312 "x": 203.88229370117
313 },
314 "part": "rightAnkle",
315 "score": 0.8255187869072
316 }
317 ]
318}
319```
320
321### Keypoints
322
323All keypoints are indexed by part id. The parts and their ids are:
324
325| Id | Part |
326| -- | -- |
327| 0 | nose |
328| 1 | leftEye |
329| 2 | rightEye |
330| 3 | leftEar |
331| 4 | rightEar |
332| 5 | leftShoulder |
333| 6 | rightShoulder |
334| 7 | leftElbow |
335| 8 | rightElbow |
336| 9 | leftWrist |
337| 10 | rightWrist |
338| 11 | leftHip |
339| 12 | rightHip |
340| 13 | leftKnee |
341| 14 | rightKnee |
342| 15 | leftAnkle |
343| 16 | rightAnkle |
344
345
346### Multi-Person Pose Estimation
347
348Multiple Pose estimation can decode multiple poses in an image. It is more complex and slightly slower than the single person algorithm, but has the advantage that if multiple people appear in an image, their detected keypoints are less likely to be associated with the wrong pose. Even if the usecase is to detect a single person’s pose, this algorithm may be more desirable in that the accidental effect of two poses being joined together won’t occur when multiple people appear in the image. It uses the `Fast greedy decoding` algorithm from the research paper [PersonLab: Person Pose Estimation and Instance Segmentation with a Bottom-Up, Part-Based, Geometric Embedding Model](https://arxiv.org/pdf/1803.08225.pdf). Both MobileNetV1 and ResNet architecture support multi-person pose estimation. Returns a **promise** that resolves with an **array of poses.**
349
350```javascript
351const net = await posenet.load();
352
353const poses = await net.estimateMultiplePoses(image, {
354 flipHorizontal: false,
355 maxDetections: 5,
356 scoreThreshold: 0.5,
357 nmsRadius: 20
358});
359```
360
361#### Params in estimateMultiplePoses()
362
363* **image** - ImageData|HTMLImageElement|HTMLCanvasElement|HTMLVideoElement
364 The input image to feed through the network.
365* **inferenceConfig** - an object containing:
366 * **flipHorizontal** - Defaults to false. If the poses should be flipped/mirrored horizontally. This should be set to true for videos where the video is by default flipped horizontally (i.e. a webcam), and you want the poses to be returned in the proper orientation.
367 * **maxDetections** - the maximum number of poses to detect. Defaults to 5.
368 * **scoreThreshold** - Only return instance detections that have root part score greater or equal to this value. Defaults to 0.5.
369 * **nmsRadius** - Non-maximum suppression part distance. It needs to be strictly positive. Two parts suppress each other if they are less than `nmsRadius` pixels away. Defaults to 20.
370
371#### Returns
372
373It returns a `promise` that resolves with an array of `pose`s, each with a confidence score and an array of `keypoints` indexed by part id, each with a score and position.
374
375##### via Script Tag
376
377```html
378<html>
379 <head>
380 <!-- Load TensorFlow.js -->
381 <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs"></script>
382 <!-- Load Posenet -->
383 <script src="https://cdn.jsdelivr.net/npm/@tensorflow-models/posenet"></script>
384 </head>
385
386 <body>
387 <img id='cat' src='/images/cat.jpg '/>
388 </body>
389 <!-- Place your code in the script tag below. You can also use an external .js file -->
390 <script>
391 var imageElement = document.getElementById('cat');
392
393 posenet.load().then(function(net){
394 return net.estimateMultiplePoses(imageElement, {
395 flipHorizontal: false,
396 maxDetections: 2,
397 scoreThreshold: 0.6,
398 nmsRadius: 20})
399 }).then(function(poses){
400 console.log(poses);
401 })
402 </script>
403</html>
404```
405
406###### via NPM
407
408```javascript
409import * as posenet from '@tensorflow-models/posenet';
410
411async function estimateMultiplePosesOnImage(imageElement) {
412 const net = await posenet.load();
413
414 // estimate poses
415 const poses = await net.estimateMultiplePoses(imageElement, {
416 flipHorizontal: false,
417 maxDetections: 2,
418 scoreThreshold: 0.6,
419 nmsRadius: 20});
420
421 return poses;
422}
423
424const imageElement = document.getElementById('people');
425
426const poses = estimateMultiplePosesOnImage(imageElement);
427
428console.log(poses);
429```
430
431This produces the output:
432```
433[
434 // pose 1
435 {
436 // pose score
437 "score": 0.42985695206067,
438 "keypoints": [
439 {
440 "position": {
441 "x": 126.09371757507,
442 "y": 97.861720561981
443 },
444 "part": "nose",
445 "score": 0.99710708856583
446 },
447 {
448 "position": {
449 "x": 132.53466176987,
450 "y": 86.429876804352
451 },
452 "part": "leftEye",
453 "score": 0.99919074773788
454 },
455 {
456 "position": {
457 "x": 100.85626316071,
458 "y": 84.421931743622
459 },
460 "part": "rightEye",
461 "score": 0.99851280450821
462 },
463
464 ...
465
466 {
467 "position": {
468 "x": 72.665352582932,
469 "y": 493.34189963341
470 },
471 "part": "rightAnkle",
472 "score": 0.0028593824245036
473 }
474 ],
475 },
476 // pose 2
477 {
478
479 // pose score
480 "score": 0.13461434583673,
481 "keypoints": [
482 {
483 "position": {
484 "x": 116.58444058895,
485 "y": 99.772533416748
486 },
487 "part": "nose",
488 "score": 0.0028593824245036
489 }
490 {
491 "position": {
492 "x": 133.49897611141,
493 "y": 79.644590377808
494 },
495 "part": "leftEye",
496 "score": 0.99919074773788
497 },
498 {
499 "position": {
500 "x": 100.85626316071,
501 "y": 84.421931743622
502 },
503 "part": "rightEye",
504 "score": 0.99851280450821
505 },
506
507 ...
508
509 {
510 "position": {
511 "x": 72.665352582932,
512 "y": 493.34189963341
513 },
514 "part": "rightAnkle",
515 "score": 0.0028593824245036
516 }
517 ],
518 },
519 // pose 3
520 {
521 // pose score
522 "score": 0.13461434583673,
523 "keypoints": [
524 {
525 "position": {
526 "x": 116.58444058895,
527 "y": 99.772533416748
528 },
529 "part": "nose",
530 "score": 0.0028593824245036
531 }
532 {
533 "position": {
534 "x": 133.49897611141,
535 "y": 79.644590377808
536 },
537 "part": "leftEye",
538 "score": 0.99919074773788
539 },
540
541 ...
542
543 {
544 "position": {
545 "x": 59.334579706192,
546 "y": 485.5936152935
547 },
548 "part": "rightAnkle",
549 "score": 0.004110524430871
550 }
551 ]
552 }
553]
554```
555
556## Developing the Demos
557
558Details for how to run the demos are included in the `demos/` folder.
559