Showing keyboard inputs with a video

Introduction

I had something in mind for my library of gameplay footage, at least in the past few years. Why not give users the option to visually see inputs? Not only am I curious about my own inputs sometimes (especially in a rhythm game or FPS), but it could also be quite satisfying to see keys light up alongside the video. It's something extra. But it's cool.

It does have some practical use though. When I was teaching online due to COVID, I did something like this to help students see what I was doing in VIM and TMUX. Here's what I mean:

When teaching, I am often asked how I run certain commands in vim, tmux, etc. Now, my students get an extra treat over Zoom... a visual graph of my keyboard presses! It's easier to explain by showing. Satisfying too! https://t.co/gQvUZau3mW pic.twitter.com/xqc7CpWI8E
— Claıre 🌺 (@iDestyKK) September 2, 2020

That's pretty cool, huh? But it only showed the keyboard. It didn't actually record button presses. It simply was OBS capturing a keyboard input shower application and pasting it into the stream.

With that, let's get started. There are 2 parts to this. We will need to write something that actually records inputs, in C++. Then we will have to write some fancy JavaScript to read the file in on a webpage and show it. Since the code will be portable, I can literally embed it in this blog post to show toward the end. Stay tuned. There are Computer Science techniques that can be used to significantly speed this up. Specifically, C++'s lower_bound.

InputRec: An app for recording inputs

For this project to work, we need real data. So we need an application specifically made for recording keyboard presses. Since I want full control, it will mean a completely custom format so I can get the exact data that I want and need. We won't be following conventions here.

A checklist of what I want

Since we are dealing with Windows, this involves dealing with the Windows API, which I am not very familiar with. To be honest, I actually wrote the first iteration of this software on 2021/02/02. I had a few things in mind for it. Here's a checklist:

Simple. It needs to be obvious what the hell to do. And it needs to just work.
- Start application. Press button. It records. Press again, you get file.
- You'd think this is obvious. Elgato would like to disagree.
Dedicated recording button. Usually F9 is my button of choice for recording. So we'll go with that.
Binary format. Squeeze as much information as possible at a byte-level, for compactness.
- Some of these files ended up at around 10-11 KB. If it were in a text-based format, like JSON, it would be much larger. Thousands of videos considered, I'd rather save the space and bandwidth considering this is meant to be downloaded alongside a YouTube video.
Nanosecond precision. The file shall contain a header with the exact moment the recording button was pressed. All button presses shall also have a nanosecond-precise delta from this starting point, with 64-bit precision. In Windows, timestamps are at 100-nanosecond precision, which is way more than what we need.
- Given that we store the nanosecond timestamp of when we record our videos, this lets us automate alignment of these files together by taking the difference between those timestamps.
Clear indication of key presses. This is... obvious. But I want each timestamp to be paired with a key press and whether it was pressed down or not. We can compact the latter into a single integer to optimise space.
Versioning. Let's say I update InputRec (which I did in 2024), I want the header to indicate that with a version number.
Safe to Anti-cheat. I was worried about this. But if you do a SetWindowsHookEx call, I wanted to make sure various anti-cheats won't pick up on the system call and flag me. I really doubted it, as other software also uses this to pick up and record.

Capturing keyboard inputs

First, we need a way to get keystrokes, whether they are pressed or released. In something like Game Maker, we had functions like keyboard_check_direct to bypass being the current window. It just checks at a hardware level. In C++, we don't have that luxury just yet. We have to write it. For that, a low-level hook is necessary.

C++ Code

MSG msg;
HHOOK hook_kb = SetWindowsHookEx(WH_KEYBOARD_LL, LowLevelKeyboardProc, 0, 0);

// Keep running until told to stop
while (!GetMessage(&msg, NULL, NULL, NULL)) {
	TranslateMessage(&msg);
	DispatchMessage(&msg);
}

We need to keep the application open. So that's why the bottom portion is in a loop. It will continuously get messages, which are our key presses. At a system level, this hooks on and monitors for those. Then it calls LowLevelKeyboardProc, which is a function. Isn't that nice?

C++ Code

LRESULT CALLBACK LowLevelKeyboardProc(int nCode, WPARAM wParam, LPARAM lParam) {
	PKBDLLHOOKSTRUCT p;

	if (nCode == HC_ACTION) {
		p = (PKBDLLHOOKSTRUCT) lParam;

		switch (wParam) {
			case WM_KEYDOWN:
				// A key was pressed. Check p->vkCode
				break;

			case WM_KEYUP:
				// A key was released. Check p->vkCode
				break;
		}
	}

	return CallNextHookEx(NULL, nCode, wParam, lParam);
}

That's pretty nifty. Now we just have to consider the record button being F9. I do this by having a #define REC_BTTN VK_F9 at the top. Then when the key is pressed, simply check if it's that button and start/stop recording. But, we need something to handle the recording and storing the inputs.

Now we can get back to pure C++, and not any Windows-specific stuff.

Class for recording inputs

Now we need a class that is capable of storing the recorded key presses and then export it into a file. The format is actually very simple. We will utilise chrono::high_resolution_clock along with Windows hooks and a deque data structure to hold all key-press events and timestamps. A small look into recorder.hpp reveals the following class prototype:

C++ Code

template <typename T>
class recorder {
    public:
		recorder();

		void record_start();
		void record_stop();

		void record_capture(const T &);
		void record_save(const string & = "recording.inp");

	private:
		struct ev_pair {
			ev_pair();
			ev_pair(const T &);
			ev_pair(const chrono::high_resolution_clock::time_point &,
				const T &);

			chrono::high_resolution_clock::time_point ts;
			T val;
		};

		chrono::system_clock::time_point sys_start;
		chrono::high_resolution_clock::time_point start;
		chrono::high_resolution_clock::time_point end;

		deque<ev_pair> events;
};

It's very simple. Every function's name makes it obvious what needs to be done. Event pairs (ev_pair) store the timestamp, along with the key pressed. And this class also allows object-oriented recording and holding onto the data to export at a later time, if I wanted to use this in another application for some reason... That's also why it's templated here. For our case, recorder<uint16_t> is perfect.

The record_capture is basically a glorified push_back function. It's what adds an event to the recording. It takes the key pressed and will figure out the timestamp and store it. This function is what will be called the moment LowLevelKeyboardProc fires and captures a key press or release.

When a key is pressed down, the value stored is the vkCode OR'd with 0x100. In other words:

C++ Code

rec_kb.record_capture(0x100 | p->vkCode);

Otherwise, it will simply store the key ID in vkCode. This is how we will be able to tell a button press from a release. And it's in a single integer.

File format, visualised

The file format may be completely custom. But that also means that there's a specification for it. Thus, behold the following first 160 bytes of an INP file:

Every INP file will have a magic number, being 1128616521 or 0x43455249 in hex, or IREC as a string. This is at the very beginning of the file. After that, a 32-bit integer for the version exists. After that, a 64-bit integer is for the timestamp of when the recording started. This is to factor in nanoseconds. So you will need to divide it by 1,000,000,000 (1 billion) to get a proper UNIX timestamp. After the start timestamp, there is a duration timestamp in the same format. After that, there's a 64-bit unsigned integer for the number of events (key presses and releases). After that, every 10 bytes is a 8-2 pair of timestamps and button inputs. The former requires division by a billion. The latter is a unsigned 16-bit integer.

To put it like a table:

Position	Name	Data Type	Notes
0x00	Magic	char[4]	Always `IREC`
0x04	Version	uint32_t
0x08	Start Timestamp	uint64_t	Divide by billion
0x10	Duration	uint64_t	Divide by billion
0x18	Number of Events	uint64_t
0x20 + (N * 0xA)	Event N timestamp	uint64_t	Divide by billion
0x28 + (N * 0xA)	Event N key ID	uint16_t	Key ID is the first 8 bytes. Press is 9th bit.

The division by a billion is done after the number is casted to a double. If you want the nanoseconds as a whole number instead, you can just modulo the number by a billion instead. To show what the timestamp contains, assume the above example with bytes 24 00 9D AB 83 BF C9 17. This decodes to 1714111705431998500. That divided by 1 billion and then thrown into GNU date reveals the following ISO 8601 compliant timestamp:

Bash Command

UNIX> date -d "@1714111705.431998500" '+%Y-%m-%dT%H:%M:%S.%N%:z'

2024-04-26T02:08:25.431998500-04:00

Every single video in the DERPG archive from 2014 onward has a timestamp like this. You can see an example of this in "Grind Series: Quantity without compromising Quality". The point is that we can take the timestamp of this file, along with the one stored in the MKV master and compute the delay between the two to guarantee that the buttons sync up automatically. Just because the user presses F9 and both my application and, say, GeForce Experience catch it doesn't mean they both start recording at the same time. So, this is necessary.

Finally, just to hit it with a digital sledgehammer, here's the format visualised with a figure:

The C-like struct syntax to the right has the definition order flipped to make it easier to read. Anyway, get the idea? Good. Because we will need this for the JavaScript implementation. Binary is fun.

This is open source, btw

With that, we have a program that is able to record keystrokes with nanosecond precision just by pressing F9. If you want the code or executable, this is all open source on my GitHub. Simply go to https://github.com/iDestyKK/InputRec.

Playback

Now that we have a format that gives us the exact data we need, it's time to throw it on the screen and actually show it alongside a video. For that, we need some JavaScript to read in the binary file and parse it. So, I introduce inp.js.

Reading in Binary Data

To get the file and read it in JavaScript, we have to resort to some weird stuff, I guess. But basically we need to make an XMLHttpRequest and override the MimeType to be plain text, so it forces the web browser to not parse it. Instead, it passes the bytes through unmodified. This means we can grab an array of the file's bytes via:

JavaScript

function inp_read(url) {
	// Create a request
	var request = new XMLHttpRequest();
	request.overrideMimeType('text/plain; charset=x-user-defined');
	request.open(
		'GET',
		url,
		false
	);
	request.send();

	// Get the binary from the file, please
	let buffer = new Uint8Array(request.response.length);

	// Save the day
	for (let i = 0, len = request.response.length; i < len; i++)
		buffer[i] = request.response.charCodeAt(i);

	// Have a nice day
	return buffer;
}

Once that's done, the INP file has some helper functions at the top which will give us the data from the header mentioned up above. That's why I wanted to write it out and document it. The following functions exist in inp.js:

inp_read_uint16
inp_read_uint32
inp_read_uint64
inp_read_int64
inp_read_double

They are all fairly straight-forward and obvious. Conveniently, the inp_read_double will perform the division by a billion for me. So we are able to read in the header and all of the data pretty easily. The binary of the INP file is tossed into a function called inp_import to do this for us. It will return an associative array that has all events, duration, etc. In fact, the loading procedure with this library is:

JavaScript

let inp_root = inp_read("file.inp");
let inp      = inp_import(inp_root);
let list     = inp_inputs_generate_list(inp);

But, loading in the events isn't enough to show all of the button presses. We have the timestamps and whether a button is pressed. But we still need to have a convenient way to turn a timestamp from a YouTube video into a "this is what buttons are pressed down at this moment".

Event frames

The inp_inputs_generate_list function generates what I'd like to call, "Button Pressing Frames". Basically, it is an array of all keys pressed at a given moment. Assume there are 256 possible keys (which is... massively overkill). The moment a key is pressed or released, a new "frame" is generated and thrown into the array updating what key is pressed or not. Each frame is timestamped too.

For instance, if I press "A", a frame is created that shows "A" being held down. Then if I press "W" while still holding down "A", it will generate another frame which shows both buttons being held down. It only does this upon a key being pressed or released. So it's optimal on memory usage.

To extend on this even further, here's 8 "frames". Notice the timestamps and buttons pressed.

8 frames for the key states of nearly 4.5 seconds is pretty nice. Once a key frame is expected to be drawn, it does so by modifying the HTML of the page. Each key ID is paired to a DIV on a webpage. And it simply changes the CSS class of that DIV to indicate it's held down or not. Doing this manually by throwing a frame at the page is really simple.

The other advantage to doing it this way is that it accounts for if the person jumps to a different position in the video randomly. It doesn't rely on a previous state. It just stores what should be pressed at a given point of time. So if someone went 4 minutes ahead or back in a video, it can just set the DIVs appropriately. But... how can we grab each "frame" effeciently?

Syncing frames to a YouTube video, naïvely

Before we actually get to this, when you embed a YouTube video on a page (like in the demo below), you may get the timestamp of it via either:

JavaScript

// This page
player.getCurrentTime();

// If on YouTube directly...
document.getElementById("movie_player").getCurrentTime();

This timestamp is in seconds, with microsecond accuracy. More than you'll ever need for a video. That's perfect. Our button "frames" are also in seconds with that same accuracy. So if we just seek to a point in that array where the timestamp matches, we will know exactly what buttons are pressed at any moment.

Technically, the correct answer is easy. Anyone can do a naïve and inefficient loop like this:

JavaScript

let list = inp_inputs_generate_list(inp);
let pos = player.getCurrentTime();
let i, sz = list.length;

// Go through all frames until the one just before "pos", if possible
for (i = 1; i < sz; i++) {
	if (list[i][0] > pos) {
		/* Update all DIVs to list[i - 1][0] */

		/* End loop early */
		break;
	}
}

While it would work, it's also very slow. It would become progressively slower as the video goes on and on. The demo I will feature down below has 5841 "frames". Do we really want to go through a loop that many times, at 60 frames a second? I don't think so... In Computer Science, we call this a linear time operation, or O(N) . We can do better. Much better.

Syncing frames to a YouTube video, efficiently

First off, notice that the frames are conveniently sorted for us already. We can take advantage of this by combining two concepts: Binary Search and Upper/Lower Bound. Binary search basically means to start in the middle of a sorted set and go either way depending on whether the data is at the lower or upper end. So if you had 1000 elements and were searching for 625, start at 500. 625 is greater than 500. So search 500 - 1000. 750 is the midpoint. 625 is less than 750. So search 500 - 750. You get 625. Instead of going 625 times in a loop, you do 3 iterations. This is known as logarithmic time, or O(lg N) , where lg N = log base 2 of N .

The thing is, we can't just rely on Binary Search alone, because we are searching for a number that almost certainly isn't in the frame array. This is where a concept similar to upper/lower bound comes into play. If the video is at 4.3 seconds and button frames exist at 4.2 and 4.49 seconds, it needs to select the 4.2 second one. It must select the one that is just under it. In C++ terms, it's the upper bound - 1 in all but one case.

Let's assume the list as like a BST (Binary Search Tree). Do note, it's still a list or array internally. But we can simulate the root node by making the midway point be the root. Then it breaks into 2 and so on. It's better to show it visually. Thus,

In this array of 15 entries, the 8th entry (frame[7]) is the root node. Notice that all nodes point down to a "key frame" with a dotted line. At the bottom, frame IDs are notated along with timestamps. Now, all we have to do is go down the tree from the top (either left or right).

Let's say the video is at 4.2 seconds. If so, it's on frame[6], as that's right at 4.2 seconds. That is very basic BST traversal. No keys are held down there in the figure above. Here's it explained visually:

That's 3 iterations. But also, that's the maximum amount of iterations possible with this setup. It's possible it finds the frame in less. It's guaranteed to be this many at most, as lg(15) is almost 3.907 . If we have 16 frames, then 4 is possible. But this equation grows very slowly. The 5841 frames I was talking about earlier? lg(5841) is almost 12.512 . I'll take doing 12 iterations 60 times a second over doing 5000+ 60 times a second.

But, we aren't done just yet. How do we account for if the timestamp is between two frames? The figure above only accounts for if we match a frame exactly. Let's say we are looking for 4.3 seconds. We need a range, rather than an exact number. The trick is to keep track of the minimum and maximum values as you traverse the tree. When we start at 7, both the minimum and maximum is 4.49. But 4.3 is less than that. Keep 4.49 as the maximum. Traverse left to 2.51. That becomes the new minimum. Go right. And keep updating the minimum. It eventually becomes 4.2, while the maximum is 4.49. The general algorithm is that when you go left, the maximum updates. When you go right, the minimum updates.

Once the "distance" (in the array) between the minimum and maximum nodes is 1, terminate the binary search and choose the minimum. The answer is guaranteed to always be true except in extreme cases like an exact match or out of bounds. Visually, we show the traversal as follows:

Contrary to Computer Science theory and fancy diagrams, the implementation is very simple. Assume a function called inp_get_state_from_ts, which will take a timestamp and return the correct "frame" via the logarithmic method shown up above. Thus:

JavaScript

function inp_get_state_from_ts(state_array, target) {
	let min, max, i, len, ts;

	len = state_array.length;
	min = 0;
	max = len;
	i = Math.floor(min + ((max - min) / 2));

	// Outlier cases
	if (ts > state_array[len - 1][0])
		return state_array[len - 1];

	if (ts < state_array[0][0])
		return state_array[0];

	// Binary Search
	while (true) {
		ts = state_array[i][0];

		// Extreme cases
		if (ts == target)
			return state_array[i];

		if (max - min == 1)
			return state_array[min];

		// Bound reconfiguring
		if (target > ts) {
			min = i;
			i = Math.floor(min + ((max - min) / 2));
		}
		else
		if (target < ts) {
			max = i;
			i = Math.floor(min + ((max - min) / 2));
		}
	}
}

Basically your introductory Data Structures and Algorithms course stuff slightly modified. To avoid having to go back, I keep track of a minimum and maximum point, which updates until the distance between the two is 1. Then I choose the minimum of the two. If we are lucky and get a perfect match somehow, then we pick that instead. Lastly, if the timestamp is out of bounds (less than first, or more than last), it assigns to the first or last elements in the array/tree respectively.

The distance between the minimum and maximum being 1 is guaranteed to happen, with the only exceptions being extreme bounds and an exact match. It's how we can account for a timestamp falling between two nodes as shown in the visual above. This playing around with the bounds is why I was careful to say it's similar to upper/lower bound. I suppose you can implement those with this mentality. And it would be very close. The only difference is that you would keep the comparison target < ts to mean go left, then go right otherwise (upper bound). Or change it to a target <= ts (lower bound).

For the record, that explanation on BST was based on when I was a graduate teaching assistant. I actually had to teach this stuff. You can check out my notes on that stuff here if you want to learn more about BSTs and other things.

See it in action!

Of course I need to actually show this off! It also gives me a place to dump example visuals that I can also go back to in the future. So, here's a YouTube video embed along with a keyboard on the side. Go on and try watching it on the page. The keyboard will always be in sync with the video regardless of what you do. Even on a frame-by-frame basis.

Esc

F10

F11

F12

TAB

[

]

CAPS

;

ENTER

SHIFT

CTRL

WIN

ALT

SPACE

RALT

RWIN

RCTRL

Well... it certainly made me realise I repress keys a lot more than I should. Also, this was updated to a longer and more extreme test case to show the true performance of this method. So excuse that the video is casually 20 days in the future from when this blog post was written.

Conclusion

As usual, I tried to put in a Computer Science concept to help explain how to make things a bit more efficient. While the example up above was just one video, there are INP files for basically every video recorded after that point. There are also some from 2021. But they have an error in them from a programming error. Maybe it's possible to recover them. But that'll be for another time.

Sometime in the future, I plan to write a page that will let you see various videos from the DERPG catalogue with this extra detail. There are other extra files that can be utilised. In MW (2019), BOCW, and Vanguard, the CoD API gives you access to match details. And so we can have a minimap that shows the exact location of kills as they happen. It's niche, but it's fun to experiment with this data.

Hope you learned something too.

Settings

Appearance

Theme

Language

Preferred Language

About

"blog.claranguyen.me" details

Introduction

InputRec: An app for recording inputs

A checklist of what I want

Capturing keyboard inputs

Class for recording inputs

File format, visualised

This is open source, btw

Playback

Reading in Binary Data

Event frames

Syncing frames to a YouTube video, naïvely

Syncing frames to a YouTube video, efficiently

See it in action!

Conclusion