Introduction
I had something in mind for my library of gameplay footage, at least in the past few years. Why not give users the option to visually see inputs? Not only am I curious about my own inputs sometimes (especially in a rhythm game or FPS), but it could also be quite satisfying to see keys light up alongside the video. It's something extra. But it's cool.
It does have some practical use though. When I was teaching online due to COVID, I did something like this to help students see what I was doing in VIM and TMUX. Here's what I mean:
When teaching, I am often asked how I run certain commands in vim, tmux, etc. Now, my students get an extra treat over Zoom... a visual graph of my keyboard presses! It's easier to explain by showing. Satisfying too! https://t.co/gQvUZau3mW pic.twitter.com/xqc7CpWI8E
— Claıre 🌺 (@iDestyKK) September 2, 2020
When teaching, I am often asked how I run certain commands in vim, tmux, etc. Now, my students get an extra treat over Zoom... a visual graph of my keyboard presses! It's easier to explain by showing. Satisfying too! https://t.co/gQvUZau3mW pic.twitter.com/xqc7CpWI8E
— Claıre 🌺 (@iDestyKK) September 2, 2020
That's pretty cool, huh? But it only showed the keyboard. It didn't actually record button presses. It simply was OBS capturing a keyboard input shower application and pasting it into the stream.
With that, let's get started. There are 2 parts to this. We will need to
write something that actually records inputs, in C++. Then we will have
to write some fancy JavaScript to read the file in on a webpage and show
it. Since the code will be portable, I can literally embed it in this blog
post to show toward the end. Stay tuned. There are Computer Science
techniques that can be used to significantly speed this up. Specifically,
C++'s
lower_bound
.
InputRec: An app for recording inputs
For this project to work, we need real data. So we need an application specifically made for recording keyboard presses. Since I want full control, it will mean a completely custom format so I can get the exact data that I want and need. We won't be following conventions here.
A checklist of what I want
Since we are dealing with Windows, this involves dealing with the Windows API, which I am not very familiar with. To be honest, I actually wrote the first iteration of this software on 2021/02/02. I had a few things in mind for it. Here's a checklist:
-
Simple. It needs to be obvious what the hell to do. And it
needs to just work.
- Start application. Press button. It records. Press again, you get file.
- You'd think this is obvious. Elgato would like to disagree.
- Dedicated recording button. Usually F9 is my button of choice for recording. So we'll go with that.
-
Binary format. Squeeze as much information as possible at a
byte-level, for compactness.
- Some of these files ended up at around 10-11 KB. If it were in a text-based format, like JSON, it would be much larger. Thousands of videos considered, I'd rather save the space and bandwidth considering this is meant to be downloaded alongside a YouTube video.
-
Nanosecond precision. The file shall contain a header with
the exact moment the recording button was pressed. All button
presses shall also have a nanosecond-precise delta from this
starting point, with 64-bit precision. In Windows, timestamps are
at 100-nanosecond precision, which is way more than what we need.
- Given that we store the nanosecond timestamp of when we record our videos, this lets us automate alignment of these files together by taking the difference between those timestamps.
- Clear indication of key presses. This is... obvious. But I want each timestamp to be paired with a key press and whether it was pressed down or not. We can compact the latter into a single integer to optimise space.
- Versioning. Let's say I update InputRec (which I did in 2024), I want the header to indicate that with a version number.
-
Safe to Anti-cheat. I was worried about this.
But if you do a
SetWindowsHookEx
call, I wanted to make sure various anti-cheats won't pick up on the system call and flag me. I really doubted it, as other software also uses this to pick up and record.
Capturing keyboard inputs
First, we need a way to get keystrokes, whether they are pressed or
released. In something like Game Maker, we had functions like
keyboard_check_direct
to bypass being the current window. It just checks at a hardware level. In
C++, we don't have that luxury just yet. We have to write it. For that, a
low-level hook is necessary.
MSG msg;
HHOOK hook_kb = SetWindowsHookEx(WH_KEYBOARD_LL, LowLevelKeyboardProc, 0, 0);
// Keep running until told to stop
while (!GetMessage(&msg, NULL, NULL, NULL)) {
TranslateMessage(&msg);
DispatchMessage(&msg);
}
We need to keep the application open. So that's why the bottom portion is
in a loop. It will continuously get messages, which are our key presses. At
a system level, this hooks on and monitors for those. Then it calls
LowLevelKeyboardProc
, which is a function. Isn't that nice?
LRESULT CALLBACK LowLevelKeyboardProc(int nCode, WPARAM wParam, LPARAM lParam) {
PKBDLLHOOKSTRUCT p;
if (nCode == HC_ACTION) {
p = (PKBDLLHOOKSTRUCT) lParam;
switch (wParam) {
case WM_KEYDOWN:
// A key was pressed. Check p->vkCode
break;
case WM_KEYUP:
// A key was released. Check p->vkCode
break;
}
}
return CallNextHookEx(NULL, nCode, wParam, lParam);
}
That's pretty nifty. Now we just have to consider the record button being
F9. I do this by having a #define REC_BTTN VK_F9
at the top.
Then when the key is pressed, simply check if it's that button and
start/stop recording. But, we need something to handle the recording and
storing the inputs.
Now we can get back to pure C++, and not any Windows-specific stuff.
Class for recording inputs
Now we need a class that is capable of storing the recorded key presses
and then export it into a file. The format is actually very simple. We will
utilise
chrono::high_resolution_clock
along with Windows hooks and a
deque
data structure to hold all key-press events and timestamps. A small look
into recorder.hpp
reveals the following class prototype:
template <typename T>
class recorder {
public:
recorder();
void record_start();
void record_stop();
void record_capture(const T &);
void record_save(const string & = "recording.inp");
private:
struct ev_pair {
ev_pair();
ev_pair(const T &);
ev_pair(const chrono::high_resolution_clock::time_point &,
const T &);
chrono::high_resolution_clock::time_point ts;
T val;
};
chrono::system_clock::time_point sys_start;
chrono::high_resolution_clock::time_point start;
chrono::high_resolution_clock::time_point end;
deque<ev_pair> events;
};
It's very simple. Every function's name makes it obvious what needs to be
done. Event pairs (ev_pair
) store the timestamp, along with
the key pressed. And this class also allows object-oriented recording and
holding onto the data to export at a later time, if I wanted to use this in
another application for some reason... That's also why it's templated here.
For our case, recorder<uint16_t>
is perfect.
The record_capture
is basically a glorified
push_back
function. It's what adds an event to the recording.
It takes the key pressed and will figure out the timestamp and store it.
This function is what will be called the moment
LowLevelKeyboardProc
fires and captures a key press or
release.
When a key is pressed down, the value stored is the vkCode OR'd with 0x100. In other words:
rec_kb.record_capture(0x100 | p->vkCode);
Otherwise, it will simply store the key ID in vkCode
. This is
how we will be able to tell a button press from a release. And it's in a
single integer.
File format, visualised
The file format may be completely custom. But that also means that there's a specification for it. Thus, behold the following first 160 bytes of an INP file:
Every INP file will have a magic number, being 1128616521 or
0x43455249
in hex, or IREC
as a string. This is
at the very beginning of the file. After that, a 32-bit integer for the
version exists. After that, a 64-bit integer is for the timestamp of when
the recording started. This is to factor in nanoseconds. So you will need
to divide it by 1,000,000,000 (1 billion) to get a proper UNIX timestamp.
After the start timestamp, there is a duration timestamp in the same
format. After that, there's a 64-bit unsigned integer for the number of
events (key presses and releases). After that, every 10 bytes is a 8-2 pair
of timestamps and button inputs. The former requires division by a billion.
The latter is a unsigned 16-bit integer.
To put it like a table:
Position | Name | Data Type | Notes |
---|---|---|---|
0x00 | Magic | char[4] | Always IREC |
0x04 | Version | uint32_t | |
0x08 | Start Timestamp | uint64_t | Divide by billion |
0x10 | Duration | uint64_t | Divide by billion |
0x18 | Number of Events | uint64_t | |
0x20 + (N * 0xA) | Event N timestamp | uint64_t | Divide by billion |
0x28 + (N * 0xA) | Event N key ID | uint16_t | Key ID is the first 8 bytes. Press is 9th bit. |
The division by a billion is done after the number is casted to a
double
. If you want the nanoseconds as a whole number instead,
you can just modulo the number by a billion instead. To show what the
timestamp contains, assume the above example with bytes
24 00 9D AB 83 BF C9 17
. This decodes to
1714111705431998500
. That divided by 1 billion and then thrown
into GNU date
reveals the following
ISO 8601 compliant timestamp:
UNIX> date -d "@1714111705.431998500" '+%Y-%m-%dT%H:%M:%S.%N%:z'
2024-04-26T02:08:25.431998500-04:00
Every single video in the DERPG archive from 2014 onward has a timestamp like this. You can see an example of this in "Grind Series: Quantity without compromising Quality". The point is that we can take the timestamp of this file, along with the one stored in the MKV master and compute the delay between the two to guarantee that the buttons sync up automatically. Just because the user presses F9 and both my application and, say, GeForce Experience catch it doesn't mean they both start recording at the same time. So, this is necessary.
Finally, just to hit it with a digital sledgehammer, here's the format visualised with a figure:
The C-like struct syntax to the right has the definition order flipped to make it easier to read. Anyway, get the idea? Good. Because we will need this for the JavaScript implementation. Binary is fun.
This is open source, btw
With that, we have a program that is able to record keystrokes with nanosecond precision just by pressing F9. If you want the code or executable, this is all open source on my GitHub. Simply go to https://github.com/iDestyKK/InputRec.
Playback
Now that we have a format that gives us the exact data we need, it's time
to throw it on the screen and actually show it alongside a video. For that,
we need some JavaScript to read in the binary file and parse it. So, I
introduce inp.js
.
Reading in Binary Data
To get the file and read it in JavaScript, we have to resort to some weird stuff, I guess. But basically we need to make anXMLHttpRequest
and override the MimeType to be plain text, so
it forces the web browser to not parse it. Instead, it passes the bytes
through unmodified. This means we can grab an array of the file's bytes
via:
function inp_read(url) {
// Create a request
var request = new XMLHttpRequest();
request.overrideMimeType('text/plain; charset=x-user-defined');
request.open(
'GET',
url,
false
);
request.send();
// Get the binary from the file, please
let buffer = new Uint8Array(request.response.length);
// Save the day
for (let i = 0, len = request.response.length; i < len; i++)
buffer[i] = request.response.charCodeAt(i);
// Have a nice day
return buffer;
}
Once that's done, the INP file has some helper functions at the top which
will give us the data from the header mentioned up above. That's why I
wanted to write it out and document it. The following functions exist in
inp.js
:
inp_read_uint16
inp_read_uint32
inp_read_uint64
inp_read_int64
inp_read_double
They are all fairly straight-forward and obvious. Conveniently, the
inp_read_double
will perform the division by a billion for me.
So we are able to read in the header and all of the data pretty easily. The
binary of the INP file is tossed into a function called
inp_import
to do this for us. It will return an associative
array that has all events, duration, etc. In fact, the loading procedure
with this library is:
let inp_root = inp_read("file.inp");
let inp = inp_import(inp_root);
let list = inp_inputs_generate_list(inp);
But, loading in the events isn't enough to show all of the button presses. We have the timestamps and whether a button is pressed. But we still need to have a convenient way to turn a timestamp from a YouTube video into a "this is what buttons are pressed down at this moment".
Event frames
The inp_inputs_generate_list
function generates what I'd like
to call, "Button Pressing Frames". Basically, it is an array of all keys
pressed at a given moment. Assume there are 256 possible keys (which is...
massively overkill). The moment a key is pressed or released, a new "frame"
is generated and thrown into the array updating what key is pressed or not.
Each frame is timestamped too.
For instance, if I press "A", a frame is created that shows "A" being held down. Then if I press "W" while still holding down "A", it will generate another frame which shows both buttons being held down. It only does this upon a key being pressed or released. So it's optimal on memory usage.
To extend on this even further, here's 8 "frames". Notice the timestamps and buttons pressed.
8 frames for the key states of nearly 4.5 seconds is pretty nice. Once a key frame is expected to be drawn, it does so by modifying the HTML of the page. Each key ID is paired to a DIV on a webpage. And it simply changes the CSS class of that DIV to indicate it's held down or not. Doing this manually by throwing a frame at the page is really simple.
The other advantage to doing it this way is that it accounts for if the person jumps to a different position in the video randomly. It doesn't rely on a previous state. It just stores what should be pressed at a given point of time. So if someone went 4 minutes ahead or back in a video, it can just set the DIVs appropriately. But... how can we grab each "frame" effeciently?
Syncing frames to a YouTube video, naïvely
Before we actually get to this, when you embed a YouTube video on a page (like in the demo below), you may get the timestamp of it via either:
// This page
player.getCurrentTime();
// If on YouTube directly...
document.getElementById("movie_player").getCurrentTime();
This timestamp is in seconds, with microsecond accuracy. More than you'll ever need for a video. That's perfect. Our button "frames" are also in seconds with that same accuracy. So if we just seek to a point in that array where the timestamp matches, we will know exactly what buttons are pressed at any moment.
Technically, the correct answer is easy. Anyone can do a naïve and inefficient loop like this:
let list = inp_inputs_generate_list(inp);
let pos = player.getCurrentTime();
let i, sz = list.length;
// Go through all frames until the one just before "pos", if possible
for (i = 1; i < sz; i++) {
if (list[i][0] > pos) {
/* Update all DIVs to list[i - 1][0] */
/* End loop early */
break;
}
}
While it would work, it's also very slow. It would become progressively slower as the video goes on and on. The demo I will feature down below has 5841 "frames". Do we really want to go through a loop that many times, at 60 frames a second? I don't think so... In Computer Science, we call this a linear time operation, or . We can do better. Much better.
Syncing frames to a YouTube video, efficiently
First off, notice that the frames are conveniently sorted for us already. We can take advantage of this by combining two concepts: Binary Search and Upper/Lower Bound. Binary search basically means to start in the middle of a sorted set and go either way depending on whether the data is at the lower or upper end. So if you had 1000 elements and were searching for 625, start at 500. 625 is greater than 500. So search 500 - 1000. 750 is the midpoint. 625 is less than 750. So search 500 - 750. You get 625. Instead of going 625 times in a loop, you do 3 iterations. This is known as logarithmic time, or , where .
The thing is, we can't just rely on Binary Search alone, because we are searching for a number that almost certainly isn't in the frame array. This is where a concept similar to upper/lower bound comes into play. If the video is at 4.3 seconds and button frames exist at 4.2 and 4.49 seconds, it needs to select the 4.2 second one. It must select the one that is just under it. In C++ terms, it's the upper bound - 1 in all but one case.
Let's assume the list as like a BST (Binary Search Tree). Do note, it's still a list or array internally. But we can simulate the root node by making the midway point be the root. Then it breaks into 2 and so on. It's better to show it visually. Thus,
In this array of 15 entries, the 8th entry (frame[7]
) is the
root node. Notice that all nodes point down to a "key frame" with a dotted
line. At the bottom, frame IDs are notated along with timestamps. Now, all
we have to do is go down the tree from the top (either left or right).
Let's say the video is at 4.2 seconds. If so, it's on
frame[6]
, as that's right at 4.2 seconds. That is very basic
BST traversal. No keys are held down there in the figure above. Here's it
explained visually:
That's 3 iterations. But also, that's the maximum amount of iterations possible with this setup. It's possible it finds the frame in less. It's guaranteed to be this many at most, as . If we have 16 frames, then 4 is possible. But this equation grows very slowly. The 5841 frames I was talking about earlier? . I'll take doing 12 iterations 60 times a second over doing 5000+ 60 times a second.
But, we aren't done just yet. How do we account for if the timestamp is between two frames? The figure above only accounts for if we match a frame exactly. Let's say we are looking for 4.3 seconds. We need a range, rather than an exact number. The trick is to keep track of the minimum and maximum values as you traverse the tree. When we start at 7, both the minimum and maximum is 4.49. But 4.3 is less than that. Keep 4.49 as the maximum. Traverse left to 2.51. That becomes the new minimum. Go right. And keep updating the minimum. It eventually becomes 4.2, while the maximum is 4.49. The general algorithm is that when you go left, the maximum updates. When you go right, the minimum updates.
Once the "distance" (in the array) between the minimum and maximum nodes is 1, terminate the binary search and choose the minimum. The answer is guaranteed to always be true except in extreme cases like an exact match or out of bounds. Visually, we show the traversal as follows:
Contrary to Computer Science theory and fancy diagrams, the implementation
is very simple. Assume a function called
inp_get_state_from_ts
, which will take a timestamp and return
the correct "frame" via the logarithmic method shown up above. Thus:
function inp_get_state_from_ts(state_array, target) {
let min, max, i, len, ts;
len = state_array.length;
min = 0;
max = len;
i = Math.floor(min + ((max - min) / 2));
// Outlier cases
if (ts > state_array[len - 1][0])
return state_array[len - 1];
if (ts < state_array[0][0])
return state_array[0];
// Binary Search
while (true) {
ts = state_array[i][0];
// Extreme cases
if (ts == target)
return state_array[i];
if (max - min == 1)
return state_array[min];
// Bound reconfiguring
if (target > ts) {
min = i;
i = Math.floor(min + ((max - min) / 2));
}
else
if (target < ts) {
max = i;
i = Math.floor(min + ((max - min) / 2));
}
}
}
Basically your introductory Data Structures and Algorithms course stuff slightly modified. To avoid having to go back, I keep track of a minimum and maximum point, which updates until the distance between the two is 1. Then I choose the minimum of the two. If we are lucky and get a perfect match somehow, then we pick that instead. Lastly, if the timestamp is out of bounds (less than first, or more than last), it assigns to the first or last elements in the array/tree respectively.
The distance between the minimum and maximum being 1 is guaranteed to
happen, with the only exceptions being extreme bounds and an exact match.
It's how we can account for a timestamp falling between two nodes as shown
in the visual above.
This playing around with the bounds is why I was careful to say it's
similar to upper/lower bound. I suppose you can implement those with
this mentality. And it would be very close. The only difference is that you
would keep the comparison target < ts
to mean go left, then
go right otherwise (upper bound). Or change it to a
target <= ts
(lower bound).
For the record, that explanation on BST was based on when I was a graduate teaching assistant. I actually had to teach this stuff. You can check out my notes on that stuff here if you want to learn more about BSTs and other things.
See it in action!
Of course I need to actually show this off! It also gives me a place to dump example visuals that I can also go back to in the future. So, here's a YouTube video embed along with a keyboard on the side. Go on and try watching it on the page. The keyboard will always be in sync with the video regardless of what you do. Even on a frame-by-frame basis.
Esc
F1
F2
F3
F4
F5
F6
F7
F8
F9
F10
F11
F12
~
1
2
3
4
5
6
7
8
9
0
-
=
BS
TAB
Q
W
E
R
T
Y
U
I
O
P
[
]
\
CAPS
A
S
D
F
G
H
J
K
L
;
'
ENTER
SHIFT
Z
X
C
V
B
N
M
,
.
/
SHIFT
CTRL
WIN
ALT
SPACE
RALT
RWIN
MENU
RCTRL
|
Well... it certainly made me realise I repress keys a lot more than I should. Also, this was updated to a longer and more extreme test case to show the true performance of this method. So excuse that the video is casually 20 days in the future from when this blog post was written.
Conclusion
As usual, I tried to put in a Computer Science concept to help explain how to make things a bit more efficient. While the example up above was just one video, there are INP files for basically every video recorded after that point. There are also some from 2021. But they have an error in them from a programming error. Maybe it's possible to recover them. But that'll be for another time.
Sometime in the future, I plan to write a page that will let you see various videos from the DERPG catalogue with this extra detail. There are other extra files that can be utilised. In MW (2019), BOCW, and Vanguard, the CoD API gives you access to match details. And so we can have a minimap that shows the exact location of kills as they happen. It's niche, but it's fun to experiment with this data.
Hope you learned something too.