SETTINGS
Appearance
Language
About

Settings

Select a category to the left.

Appearance

Theme

Light or dark? Choose how the site looks to you by clicking an image below.

Light Dark AMOLED

Language

Preferred Language

All content on blog.claranguyen.me is originally in UK English. However, if content exists in your preferred language, it will display as that instead. Feel free to choose that below. This will require a page refresh to take effect.

About

"blog.claranguyen.me" details

Domain Name: claranguyen.me
Site Version: 1.0.0
Last Updated: 2020/12/29
Showing keyboard inputs with a video
Sunday, April 28, 2024

Introduction

I had something in mind for my library of gameplay footage, at least in the past few years. Why not give users the option to visually see inputs? Not only am I curious about my own inputs sometimes (especially in a rhythm game or FPS), but it could also be quite satisfying to see keys light up alongside the video. It's something extra. But it's cool.

It does have some practical use though. When I was teaching online due to COVID, I did something like this to help students see what I was doing in VIM and TMUX. Here's what I mean:

That's pretty cool, huh? But it only showed the keyboard. It didn't actually record button presses. It simply was OBS capturing a keyboard input shower application and pasting it into the stream.

With that, let's get started. There are 2 parts to this. We will need to write something that actually records inputs, in C++. Then we will have to write some fancy JavaScript to read the file in on a webpage and show it. Since the code will be portable, I can literally embed it in this blog post to show toward the end. Stay tuned. There are Computer Science techniques that can be used to significantly speed this up. Specifically, C++'s lower_bound.

InputRec: An app for recording inputs

For this project to work, we need real data. So we need an application specifically made for recording keyboard presses. Since I want full control, it will mean a completely custom format so I can get the exact data that I want and need. We won't be following conventions here.

A checklist of what I want

Since we are dealing with Windows, this involves dealing with the Windows API, which I am not very familiar with. To be honest, I actually wrote the first iteration of this software on 2021/02/02. I had a few things in mind for it. Here's a checklist:

Capturing keyboard inputs

First, we need a way to get keystrokes, whether they are pressed or released. In something like Game Maker, we had functions like keyboard_check_direct to bypass being the current window. It just checks at a hardware level. In C++, we don't have that luxury just yet. We have to write it. For that, a low-level hook is necessary.

C++ Code
MSG msg;
HHOOK hook_kb = SetWindowsHookEx(WH_KEYBOARD_LL, LowLevelKeyboardProc, 0, 0);

// Keep running until told to stop
while (!GetMessage(&msg, NULL, NULL, NULL)) {
	TranslateMessage(&msg);
	DispatchMessage(&msg);
}

We need to keep the application open. So that's why the bottom portion is in a loop. It will continuously get messages, which are our key presses. At a system level, this hooks on and monitors for those. Then it calls LowLevelKeyboardProc, which is a function. Isn't that nice?

C++ Code
LRESULT CALLBACK LowLevelKeyboardProc(int nCode, WPARAM wParam, LPARAM lParam) {
	PKBDLLHOOKSTRUCT p;

	if (nCode == HC_ACTION) {
		p = (PKBDLLHOOKSTRUCT) lParam;

		switch (wParam) {
			case WM_KEYDOWN:
				// A key was pressed. Check p->vkCode
				break;

			case WM_KEYUP:
				// A key was released. Check p->vkCode
				break;
		}
	}

	return CallNextHookEx(NULL, nCode, wParam, lParam);
}

That's pretty nifty. Now we just have to consider the record button being F9. I do this by having a #define REC_BTTN VK_F9 at the top. Then when the key is pressed, simply check if it's that button and start/stop recording. But, we need something to handle the recording and storing the inputs.

Now we can get back to pure C++, and not any Windows-specific stuff.

Class for recording inputs

Now we need a class that is capable of storing the recorded key presses and then export it into a file. The format is actually very simple. We will utilise chrono::high_resolution_clock along with Windows hooks and a deque data structure to hold all key-press events and timestamps. A small look into recorder.hpp reveals the following class prototype:

C++ Code
template <typename T>
class recorder {
    public:
		recorder();

		void record_start();
		void record_stop();

		void record_capture(const T &);
		void record_save(const string & = "recording.inp");

	private:
		struct ev_pair {
			ev_pair();
			ev_pair(const T &);
			ev_pair(const chrono::high_resolution_clock::time_point &,
				const T &);

			chrono::high_resolution_clock::time_point ts;
			T val;
		};

		chrono::system_clock::time_point sys_start;
		chrono::high_resolution_clock::time_point start;
		chrono::high_resolution_clock::time_point end;

		deque<ev_pair> events;
};

It's very simple. Every function's name makes it obvious what needs to be done. Event pairs (ev_pair) store the timestamp, along with the key pressed. And this class also allows object-oriented recording and holding onto the data to export at a later time, if I wanted to use this in another application for some reason... That's also why it's templated here. For our case, recorder<uint16_t> is perfect.

The record_capture is basically a glorified push_back function. It's what adds an event to the recording. It takes the key pressed and will figure out the timestamp and store it. This function is what will be called the moment LowLevelKeyboardProc fires and captures a key press or release.

When a key is pressed down, the value stored is the vkCode OR'd with 0x100. In other words:

C++ Code
rec_kb.record_capture(0x100 | p->vkCode);

Otherwise, it will simply store the key ID in vkCode. This is how we will be able to tell a button press from a release. And it's in a single integer.

File format, visualised

The file format may be completely custom. But that also means that there's a specification for it. Thus, behold the following first 160 bytes of an INP file:

Every INP file will have a magic number, being 1128616521 or 0x43455249 in hex, or IREC as a string. This is at the very beginning of the file. After that, a 32-bit integer for the version exists. After that, a 64-bit integer is for the timestamp of when the recording started. This is to factor in nanoseconds. So you will need to divide it by 1,000,000,000 (1 billion) to get a proper UNIX timestamp. After the start timestamp, there is a duration timestamp in the same format. After that, there's a 64-bit unsigned integer for the number of events (key presses and releases). After that, every 10 bytes is a 8-2 pair of timestamps and button inputs. The former requires division by a billion. The latter is a unsigned 16-bit integer.

To put it like a table:

Position Name Data Type Notes
0x00 Magic char[4] Always IREC
0x04 Version uint32_t
0x08 Start Timestamp uint64_t Divide by billion
0x10 Duration uint64_t Divide by billion
0x18 Number of Events uint64_t
0x20 + (N * 0xA) Event N timestamp uint64_t Divide by billion
0x28 + (N * 0xA) Event N key ID uint16_t Key ID is the first 8 bytes. Press is 9th bit.

The division by a billion is done after the number is casted to a double. If you want the nanoseconds as a whole number instead, you can just modulo the number by a billion instead. To show what the timestamp contains, assume the above example with bytes 24 00 9D AB 83 BF C9 17. This decodes to 1714111705431998500. That divided by 1 billion and then thrown into GNU date reveals the following ISO 8601 compliant timestamp:

Bash Command
UNIX> date -d "@1714111705.431998500" '+%Y-%m-%dT%H:%M:%S.%N%:z'

2024-04-26T02:08:25.431998500-04:00

Every single video in the DERPG archive from 2014 onward has a timestamp like this. You can see an example of this in "Grind Series: Quantity without compromising Quality". The point is that we can take the timestamp of this file, along with the one stored in the MKV master and compute the delay between the two to guarantee that the buttons sync up automatically. Just because the user presses F9 and both my application and, say, GeForce Experience catch it doesn't mean they both start recording at the same time. So, this is necessary.

Finally, just to hit it with a digital sledgehammer, here's the format visualised with a figure:

The C-like struct syntax to the right has the definition order flipped to make it easier to read. Anyway, get the idea? Good. Because we will need this for the JavaScript implementation. Binary is fun.

This is open source, btw

With that, we have a program that is able to record keystrokes with nanosecond precision just by pressing F9. If you want the code or executable, this is all open source on my GitHub. Simply go to https://github.com/iDestyKK/InputRec.

Playback

Now that we have a format that gives us the exact data we need, it's time to throw it on the screen and actually show it alongside a video. For that, we need some JavaScript to read in the binary file and parse it. So, I introduce inp.js.

Reading in Binary Data

To get the file and read it in JavaScript, we have to resort to some weird stuff, I guess. But basically we need to make an XMLHttpRequest and override the MimeType to be plain text, so it forces the web browser to not parse it. Instead, it passes the bytes through unmodified. This means we can grab an array of the file's bytes via:

JavaScript
function inp_read(url) {
	// Create a request
	var request = new XMLHttpRequest();
	request.overrideMimeType('text/plain; charset=x-user-defined');
	request.open(
		'GET',
		url,
		false
	);
	request.send();

	// Get the binary from the file, please
	let buffer = new Uint8Array(request.response.length);

	// Save the day
	for (let i = 0, len = request.response.length; i < len; i++)
		buffer[i] = request.response.charCodeAt(i);

	// Have a nice day
	return buffer;
}

Once that's done, the INP file has some helper functions at the top which will give us the data from the header mentioned up above. That's why I wanted to write it out and document it. The following functions exist in inp.js:

They are all fairly straight-forward and obvious. Conveniently, the inp_read_double will perform the division by a billion for me. So we are able to read in the header and all of the data pretty easily. The binary of the INP file is tossed into a function called inp_import to do this for us. It will return an associative array that has all events, duration, etc. In fact, the loading procedure with this library is:

JavaScript
let inp_root = inp_read("file.inp");
let inp      = inp_import(inp_root);
let list     = inp_inputs_generate_list(inp);

But, loading in the events isn't enough to show all of the button presses. We have the timestamps and whether a button is pressed. But we still need to have a convenient way to turn a timestamp from a YouTube video into a "this is what buttons are pressed down at this moment".

Event frames

The inp_inputs_generate_list function generates what I'd like to call, "Button Pressing Frames". Basically, it is an array of all keys pressed at a given moment. Assume there are 256 possible keys (which is... massively overkill). The moment a key is pressed or released, a new "frame" is generated and thrown into the array updating what key is pressed or not. Each frame is timestamped too.

For instance, if I press "A", a frame is created that shows "A" being held down. Then if I press "W" while still holding down "A", it will generate another frame which shows both buttons being held down. It only does this upon a key being pressed or released. So it's optimal on memory usage.

To extend on this even further, here's 8 "frames". Notice the timestamps and buttons pressed.

8 frames for the key states of nearly 4.5 seconds is pretty nice. Once a key frame is expected to be drawn, it does so by modifying the HTML of the page. Each key ID is paired to a DIV on a webpage. And it simply changes the CSS class of that DIV to indicate it's held down or not. Doing this manually by throwing a frame at the page is really simple.

The other advantage to doing it this way is that it accounts for if the person jumps to a different position in the video randomly. It doesn't rely on a previous state. It just stores what should be pressed at a given point of time. So if someone went 4 minutes ahead or back in a video, it can just set the DIVs appropriately. But... how can we grab each "frame" effeciently?

Syncing frames to a YouTube video, naïvely

Before we actually get to this, when you embed a YouTube video on a page (like in the demo below), you may get the timestamp of it via either:

JavaScript
// This page
player.getCurrentTime();

// If on YouTube directly...
document.getElementById("movie_player").getCurrentTime();

This timestamp is in seconds, with microsecond accuracy. More than you'll ever need for a video. That's perfect. Our button "frames" are also in seconds with that same accuracy. So if we just seek to a point in that array where the timestamp matches, we will know exactly what buttons are pressed at any moment.

Technically, the correct answer is easy. Anyone can do a naïve and inefficient loop like this:

JavaScript
let list = inp_inputs_generate_list(inp);
let pos = player.getCurrentTime();
let i, sz = list.length;

// Go through all frames until the one just before "pos", if possible
for (i = 1; i < sz; i++) {
	if (list[i][0] > pos) {
		/* Update all DIVs to list[i - 1][0] */

		/* End loop early */
		break;
	}
}

While it would work, it's also very slow. It would become progressively slower as the video goes on and on. The demo I will feature down below has 5841 "frames". Do we really want to go through a loop that many times, at 60 frames a second? I don't think so... In Computer Science, we call this a linear time operation, or O(N). We can do better. Much better.

Syncing frames to a YouTube video, efficiently

First off, notice that the frames are conveniently sorted for us already. We can take advantage of this by combining two concepts: Binary Search and Upper/Lower Bound. Binary search basically means to start in the middle of a sorted set and go either way depending on whether the data is at the lower or upper end. So if you had 1000 elements and were searching for 625, start at 500. 625 is greater than 500. So search 500 - 1000. 750 is the midpoint. 625 is less than 750. So search 500 - 750. You get 625. Instead of going 625 times in a loop, you do 3 iterations. This is known as logarithmic time, or O(lg N), where lg N = log base 2 of N.

The thing is, we can't just rely on Binary Search alone, because we are searching for a number that almost certainly isn't in the frame array. This is where a concept similar to upper/lower bound comes into play. If the video is at 4.3 seconds and button frames exist at 4.2 and 4.49 seconds, it needs to select the 4.2 second one. It must select the one that is just under it. In C++ terms, it's the upper bound - 1 in all but one case.

Let's assume the list as like a BST (Binary Search Tree). Do note, it's still a list or array internally. But we can simulate the root node by making the midway point be the root. Then it breaks into 2 and so on. It's better to show it visually. Thus,

In this array of 15 entries, the 8th entry (frame[7]) is the root node. Notice that all nodes point down to a "key frame" with a dotted line. At the bottom, frame IDs are notated along with timestamps. Now, all we have to do is go down the tree from the top (either left or right).

Let's say the video is at 4.2 seconds. If so, it's on frame[6], as that's right at 4.2 seconds. That is very basic BST traversal. No keys are held down there in the figure above. Here's it explained visually:

That's 3 iterations. But also, that's the maximum amount of iterations possible with this setup. It's possible it finds the frame in less. It's guaranteed to be this many at most, as lg(15) is almost 3.907. If we have 16 frames, then 4 is possible. But this equation grows very slowly. The 5841 frames I was talking about earlier? lg(5841) is almost 12.512. I'll take doing 12 iterations 60 times a second over doing 5000+ 60 times a second.

But, we aren't done just yet. How do we account for if the timestamp is between two frames? The figure above only accounts for if we match a frame exactly. Let's say we are looking for 4.3 seconds. We need a range, rather than an exact number. The trick is to keep track of the minimum and maximum values as you traverse the tree. When we start at 7, both the minimum and maximum is 4.49. But 4.3 is less than that. Keep 4.49 as the maximum. Traverse left to 2.51. That becomes the new minimum. Go right. And keep updating the minimum. It eventually becomes 4.2, while the maximum is 4.49. The general algorithm is that when you go left, the maximum updates. When you go right, the minimum updates.

Once the "distance" (in the array) between the minimum and maximum nodes is 1, terminate the binary search and choose the minimum. The answer is guaranteed to always be true except in extreme cases like an exact match or out of bounds. Visually, we show the traversal as follows:

Contrary to Computer Science theory and fancy diagrams, the implementation is very simple. Assume a function called inp_get_state_from_ts, which will take a timestamp and return the correct "frame" via the logarithmic method shown up above. Thus:

JavaScript
function inp_get_state_from_ts(state_array, target) {
	let min, max, i, len, ts;

	len = state_array.length;
	min = 0;
	max = len;
	i = Math.floor(min + ((max - min) / 2));

	// Outlier cases
	if (ts > state_array[len - 1][0])
		return state_array[len - 1];

	if (ts < state_array[0][0])
		return state_array[0];

	// Binary Search
	while (true) {
		ts = state_array[i][0];

		// Extreme cases
		if (ts == target)
			return state_array[i];

		if (max - min == 1)
			return state_array[min];

		// Bound reconfiguring
		if (target > ts) {
			min = i;
			i = Math.floor(min + ((max - min) / 2));
		}
		else
		if (target < ts) {
			max = i;
			i = Math.floor(min + ((max - min) / 2));
		}
	}
}

Basically your introductory Data Structures and Algorithms course stuff slightly modified. To avoid having to go back, I keep track of a minimum and maximum point, which updates until the distance between the two is 1. Then I choose the minimum of the two. If we are lucky and get a perfect match somehow, then we pick that instead. Lastly, if the timestamp is out of bounds (less than first, or more than last), it assigns to the first or last elements in the array/tree respectively.

The distance between the minimum and maximum being 1 is guaranteed to happen, with the only exceptions being extreme bounds and an exact match. It's how we can account for a timestamp falling between two nodes as shown in the visual above. This playing around with the bounds is why I was careful to say it's similar to upper/lower bound. I suppose you can implement those with this mentality. And it would be very close. The only difference is that you would keep the comparison target < ts to mean go left, then go right otherwise (upper bound). Or change it to a target <= ts (lower bound).

For the record, that explanation on BST was based on when I was a graduate teaching assistant. I actually had to teach this stuff. You can check out my notes on that stuff here if you want to learn more about BSTs and other things.

See it in action!

Of course I need to actually show this off! It also gives me a place to dump example visuals that I can also go back to in the future. So, here's a YouTube video embed along with a keyboard on the side. Go on and try watching it on the page. The keyboard will always be in sync with the video regardless of what you do. Even on a frame-by-frame basis.

Esc
F1
F2
F3
F4
F5
F6
F7
F8
F9
F10
F11
F12

~
1
2
3
4
5
6
7
8
9
0
-
=
BS

TAB
Q
W
E
R
T
Y
U
I
O
P
[
]
\

CAPS
A
S
D
F
G
H
J
K
L
;
'
ENTER

SHIFT
Z
X
C
V
B
N
M
,
.
/
SHIFT

CTRL
WIN
ALT
SPACE
RALT
RWIN
MENU
RCTRL

Well... it certainly made me realise I repress keys a lot more than I should. Also, this was updated to a longer and more extreme test case to show the true performance of this method. So excuse that the video is casually 20 days in the future from when this blog post was written.

Conclusion

As usual, I tried to put in a Computer Science concept to help explain how to make things a bit more efficient. While the example up above was just one video, there are INP files for basically every video recorded after that point. There are also some from 2021. But they have an error in them from a programming error. Maybe it's possible to recover them. But that'll be for another time.

Sometime in the future, I plan to write a page that will let you see various videos from the DERPG catalogue with this extra detail. There are other extra files that can be utilised. In MW (2019), BOCW, and Vanguard, the CoD API gives you access to match details. And so we can have a minimap that shows the exact location of kills as they happen. It's niche, but it's fun to experiment with this data.

Hope you learned something too.




Clara Nguyễn
Hi! I am a Vietnamese/Italian mix with a Master's Degree in Computer Science from UTK. I have been programming since I was 6 and love to write apps and tools to make people's lives easier. I also love to do photography and media production. Nice to meet you!


Blog Links
Post Archive
Affiliates/Cool People
Nigoli's Blog
Raas's Blog