Falsehoods programmers believe about [video stuff]
by Niklas Haas on December 25, 2016
Inspired by numerous other such lists of falsehoods. Pretty much every video player in existence gets a good chunk if not the vast majority of these wrong. (Some of these also/mostly apply to users, though)
Falsehoods programmers believe about..
.. video decoding
- decoding is bit-exact, so the decoder used does not affect the quality
- since H.264 decoding is bit-exact, the decoder used does not affect the quality1
- hardware decoding means I don’t have to worry about performance
- hardware decoding is always faster than software decoding
- a H.264 hardware decoder can decode all H.264 files
- a H.264 software decoder can decode all H.264 files
- video decoding is easily parallelizable
.. video playback
- the display’s refresh rate will be an integer multiple of the video file’s frame rate
- the display’s clock will be in sync with the audio clock
- I can accurately measure the display’s clock
- I can accurately measure the audio clock
- I can exclusively use the audio clock for timing
- I can exclusively use the video clock for timing
- my hardware contexts will survive the user’s coffee break
- my hardware contexts will never disappear in the middle of playback
- I can always request a new hardware context after my previous one disappeared
- it’s okay to error and quit if I can’t request a hardware context
- hardware decoding and video playback will happen on the same device
- transferring frames from one device to another is easy
- the user will not notice 3:2 pulldown
- the user will not notice the odd dropped or duplicated frame
- all video frames will be unique
- all video frames will be decoded in order
- all video sources can be seeked in
- the user will never want to seek to non-keyframes
- seeking to a position will produce the same output as decoding to a position
- I can seek to a specific frame number
- videos have a fixed frame rate
- all frame timestamps are precise
- all frame timestamps are precise in modern formats like .mkv
- all frame timestamps are monotonically increasing
- all frame timestamps are monotonically increasing as long as you don’t seek
- all frame timestamps are unique
- the duration of the final video frame is always known
- users will not notice if I skip the final video frame
- users will never want to play videos in reverse
- users will not notice if I skip a video frame when pausing
.. video/image files
- all video files have 8-bit per channel color
- all video files have 8-bit or 10-bit per channel color
- fine, but at least all channels are going to have the same number of bits
- all samples are going to fit into a 32-bit integer
- every pixel consists of three samples
- every pixel consists of three or four samples
- fine, every pixel consists of n samples
- all images files are sRGB
- all video files are BT.601 or BT.709
- all image files are either sRGB or contain an ICC profile
- 4:2:0 is the only way to subsample images
- all image files contain correct tags indicating their color space
- interlaced video files no longer exist
- I can detect whether a file is interlaced or not
- the chroma location is the same for every YCbCr file
- all HD videos are BT.709
- video files will have the same refresh rate throughout the stream
- video files will have the same resolution throughout the stream
- video files will have the same color space throughout the stream
- video files will have the same pixel format throughout the stream
- fine, videos will have the same video codec throughout the stream
- the video and audio tracks will start at the same time
- the video and audio tracks will both be present throughout the stream
- I can start playing an audio file at the first decoded sample, and stop playing it at the last
- virtual timelines can be implemented on the demuxer level
- adjacent frames will have similar durations
- all multimedia formats have easily identifiable headers
- a file will never be a legal JPEG and MP3 at the same time
- applying heuristics to guess the right filetype is easy
.. image scaling
- the GPU’s built-in bilinear scaling is sufficient for everybody
- bicubic scaling is sufficient for everybody
- the image can just be scaled in its native color space
- I should linearize before scaling
- I shouldn’t linearize before scaling
- upscaling is the same as downscaling
- the quality of scaling algorithms can be objectively measured
- the slower a scaling algorithm is to compute, the better it will be
- upscaling algorithms can invent information that doesn’t exist in the image
- my scaling ratio is going to be the same in the x axis and the y axis
- chroma upscaling isn’t as important as luma upscaling
- chroma and luma can/should be scaled separately
- I can ignore sub-pixel offsets when scaling and aligning planes
- I should always take sub-pixel offsets into account when scaling
- images contain no information above the Nyquist frequency
- images contain no information outside the TV signal range
.. color spaces
- all colors are specified in (R,G,B) triples
- all colors are specified in RGB or CMYK
- fine, all colors are specified in RGB, CMYK, HSV, HSL, YCbCr or XYZ
- there is only one RGB color space
- there is only one YCbCr color space for each RGB color space
- fine, there is only one YCbCr color space for each RGB color space up to linear isomorphism
- an RGB triple unambiguously specifies a color
- an RGB triple + primaries unambiguously specifies a color
- fine, a CIE XYZ triple unambiguously specifies a color
- black is RGB (0,0,0), and white is RGB (255,255,255)
- all color spaces have the same white point
- color spaces are defined by the RGB primaries and white point
- my users are not going to notice the difference between BT.601 and BT.709
- there’s only one BT.601 color space
- TV range YCbCr is the same thing as TV range RGB
- full-range YCbCr doesn’t exist
- standards bodies can agree on what full-range YCbCr means
- b-bit full range means the interval [0, 2^b-1]
- a full range 8-bit color value of 255 maps to the float 1.0
- color spaces are two-dimensional
- “linear light” means “linear light”
- information outside of the interval [0,1] should always be discarded/clamped
- all gamma curves are well defined outside of the interval [0,1]
- HDR encoding is about making the image brighter
- HDR encoding means darker blacks
.. color conversion
- I don’t need to convert an image’s colors before displaying it on the screen
- all color spaces are just linearly related
- there’s only one way to convert between color spaces
- I can just clip out-of-gamut colors after conversion
- there’s only one way to pull 10-bit colors up to 16-bit precision
- linearization happens after RGB conversion
- I can freely convert between color spaces as long as I allow out-of-gamut colors
- converting between color spaces is a mathematical process so it doesn’t depend on the display
- converting from A to B is just the inverse of converting from B to A
- the OOTF is conceptually part of the OETF
- the OOTF is conceptually part of the EOTF
- all OOTFs are reversible
- all CMMs implement color conversion correctly
- all professional CMMs implement color conversion correctly
- I don’t need to dither after converting if the target colorspace is the same bit depth or higher
- converting between bit depths is just a logical shift
- converting between bit depths is just a multiplication
- all ICC profiles contain tables for conversion in both directions
- HDR tone-mapping is well-defined
- HDR tone-mapping is well-defined if you know the source and target display capabilities
- HDR metadata will always match the video stream
- you can easily convert between PQ and HLG
- you can easily convert between PQ and HLG if you know the mastering display’s metadata
- converting from A to linear light to B gives you the same result as converting from A to B
.. video output
- the graphics API will dither my output for me
- there’s only one way to dither output
- I need to dither to whatever my backbuffer precision is
- dithering with random noise looks good
- dithering artifacts are not visible at 6-bit precision
- dithering artifacts are not visible at 7-bit precision
- dithering artifacts are not visible at 8-bit precision
- temporal dithering is better than static dithering
- OpenGL is well-supported on all operating systems
- OpenGL is well-supported on any operating system
- waiting until the next vsync is easy in OpenGL
- video drivers correctly implement the texture formats they advertise
- I can accurately measure vsync timings
- vsync timings are consistent for a fixed refresh rate
- all displays with the same rate will vsync at the same time
- I can control the window size and position
.. displays
- all displays are 60 Hz
- all refresh rates are integers
- all displays have a fixed refresh rate
- all displays are sRGB
- all displays are approximately sRGB
- displays have an infinite contrast
- all displays have a contrast of around 1000:1
- all displays have a white point of D65
- all displays have square pixels
- all displays use 8-bit per channel color
- all displays are PC displays
- my users will provide an ICC profile for their display
- my users will only use a single display
- my users will only use a single display for the duration of a video
- all ICC profiles for displays will have the same rendering intent
- all ICC profiles for displays will be black-scaled
- all ICC profiles for displays won’t be black-scaled
.. subtitles
- all subtitle files are UTF-8 encoded
- all subtitles are stored/rendered as RGB
- I can paint RGB subtitles on top of my RGB video files
- I don’t need to worry about color management for subtitles
- the subtitle color space will be the same as the video color space
- rendering subtitles at the output resolution is always better than rendering them at the video resolution
- there’s an ASS specification
It seems a lot of people have misunderstood this one, so let me clarify what I mean: Of course, H.264 decoders (assuming no bugs) will output the same result, but the problem in practice is that you have no guarantee you’ll actually be able to access the decoder outputs unmodified, because APIs like DXVA/DXVA2, D3D11VA (through ANGLE), CrystalHD, VAAPI through GLX and VDPAU (unless you use a terrible interlaced-only hack) will further post process the results before you can access them, either by converting to RGB, changing the subsampling or rounding down 10-bit content down to 8-bit.
There are some APIs which are inherently safe, though, although it usually requires copying back to system RAM instead of exposing it as an on-GPU texture, so you gain extra round-trip bandwidth losses (bidirectional, instead of the one-directional cost you have to pay for swdec). The only exceptions I can think of right now are VAAPI EGL interop and CUDA.↩︎