Reading files the hard way - Part 3 (ftrace, disk layouts, ext4)
Aug 31, 2019
27 minute read

This article is part of the series Reading files the hard way:

So far, we’ve seen many ways to read a file from different programming languages, we’ve learned about syscalls, how to make those from assembly, then we’ve learned about memory mapping, virtual address spaces, and generally some of the mechanisms in which userland and the kernel interact.

But in our exploration, we’ve always considered the kernel more or less like a “black box”. It’s time to change that.

In the belly of the beast

In part 2, we’ve used gdb to step through our userland program, readfile. This allowed us to inspect the state of CPU registers after each instruction. But we weren’t able to see what was going on in the kernel.

We’ve also used strace to trace system calls. It was less noisy than gdb, because it only displayed information when our userland program was communicating with the kernel.

Now it’s time to step it up and use ftrace. It’s a tracing facility integrated into the Linux kernel. Thanks to it, we’re able to enable tracing of various kernel functions, and events.

Cool bear's hot tip

There’s an excellent Introduction to ftrace over on Julia Evans’s blog.

Speaking of Julia, check out her fantastic zines!

The easiest way for us to get started is to use trace-cmd. On Manjaro/ArchLinux, there’s an AUR package for that, so we can just grab it:

yay -S trace-cmd

The plan is to:

  • Enable tracing for all ext4 events (more on this later)
  • Run our readfile program (the assembly one)
  • Look at the report and try to piece together what’s going on

My first trace wasn’t successful - it only contained events related to zsh (my shell), and git (called by my shell’s prompt to show indicators).

It turns out there are several levels of caching, and /etc/hosts was already cached to the maximum, so reading it didn’t even register.

Luckily, we can clear all these caches (while the system is running!) by running the following as root:

sync
echo 3 > /proc/sys/vm/drop_caches

Then, we can start tracing (again, as root):

trace-cmd record -e ext4

Then launch our program, and ask trace-cmd to format the collected data in a human-readable format:

trace-cmd report trace.dat > trace.txt

And there we have it:

readfile-30843 [001]  1557.395867: ext4_es_lookup_extent_enter: dev 8,1 ino 1311701 lblk 0
readfile-30843 [001]  1557.395869: ext4_es_lookup_extent_exit: dev 8,1 ino 1311701 found 0 [0/0) 0 
readfile-30843 [001]  1557.395872: ext4_ext_map_blocks_enter: dev 8,1 ino 1311701 lblk 0 len 3 flags 
readfile-30843 [001]  1557.395875: ext4_ext_show_extent: dev 8,1 ino 1311701 lblk 0 pblk 5339898 len 3
readfile-30843 [001]  1557.395877: ext4_ext_map_blocks_exit: dev 8,1 ino 1311701 flags  lblk 0 pblk 5339898 len 3 mflags 0x20 ret 3
readfile-30843 [001]  1557.395878: ext4_es_insert_extent: dev 8,1 ino 1311701 es [0/3) mapped 5339898 status W
readfile-30843 [001]  1557.403239: ext4_es_lookup_extent_enter: dev 8,1 ino 1840090 lblk 0
readfile-30843 [001]  1557.403243: ext4_es_lookup_extent_exit: dev 8,1 ino 1840090 found 0 [0/0) 0 
readfile-30843 [001]  1557.403245: ext4_ext_map_blocks_enter: dev 8,1 ino 1840090 lblk 0 len 1 flags 
readfile-30843 [001]  1557.403248: ext4_ext_show_extent: dev 8,1 ino 1840090 lblk 0 pblk 26724819 len 1
readfile-30843 [001]  1557.403251: ext4_ext_map_blocks_exit: dev 8,1 ino 1840090 flags  lblk 0 pblk 26724819 len 1 mflags 0x20 ret 1
readfile-30843 [001]  1557.403252: ext4_es_insert_extent: dev 8,1 ino 1840090 es [0/1) mapped 26724819 status W

The columns appear to be formatted like so:

process_name-pid [cpu_core]  timestamp: event_name: parameters

I was unable to find precise documentation on each of the events, but we can spot that two main things are being done here:

  1. Lookup and map “inode 1311701”
  2. Lookup and map “inode 1840090”

And that’s it.

What did we learn?

ftrace allows tracing behavior inside kernel code. We can’t change it, but we can get a better idea what’s going on.

Where did our paths go?

Whereas strace showed us /etc/hosts, the path of the file we’re mapping, ftrace doesn’t. Instead it shows us inode numbers. To understand why, we must think about the ergonomics of a file system.

✨ Let’s talk about symbolic links ✨

Symbolic links (symlinks for short) are files, but they’re not regular files. They point to another file, and whenever they’re opened, the other file should be opened instead. There’s.. reasons to use symlinks, which I won’t get into, just trust me on that one.

Point is, there’s a bunch of symlinks in almost any Linux system, for example in /usr/lib:

Here, libcairo.so and libcairo.so.2 both point to libcairo.so.2.11703.0. If we run a command like wc (which count words, but also bytes), we see that it’s accessing the file being pointed to in both cases:

However, they’re not copies. They’re just.. symbolic links. As for implementation details, we notice two things: the size of those symlink files is exactly the size of the path they point to, and they have a slightly different mode:

So symbolic links are files, with a special mode flag, whose contents is the path they point to. This means they can be dangling!

Cool bear's hot tip

A dangling symlink is one that points to a non-existent file.

See the example below:

You can make a filesystem out of that (or can you?)

Let’s talk about folders. As we’ve seen, symlinks are just files, with a special mode, and their contents is a path. Couldn’t we do the same for folders?

Let’s make up a filesystem. What’s a filesystem? Just a way to organize file and folders, their contents, and their metadata on disk. You can think of a disk as a big blank slate, like so:

Cool bear's hot tip

Throughout the rest of this article, we’ll be using the word “disk” when what we really mean is “storage device”. It’s shorter!

And let’s say that, just for fun, we divide this disk in a series of blocks. Let’s make them 4096 bytes (or 4KiB). That way, they might be the same size as kernel memory pages. Or a multiple. At least they’re a power of two.

Let’s say in our filesystem, each folder is just a list of names: their children. We can traverse the filesystem down just fine:

  • / has /etc in its list of children
  • /etc has /etc/hosts in its list of children
  • /etc/hosts is a regular file

Sounds good right? I mean, we still have to solve a few problems.

First, we need to decide where on the disk our folders and files live. But that’s not a terribly hard problem. We can just have a list of paths and a list of offsets, maybe like that:

And when we want to access /etc/hosts, well, we just: 1) go through our table, find the right entry, and 2) read it!

This doesn’t seem so bad. Why is it problematic?

Problem 1: Looking for a file is O(n), where n is the number of files.

I’ve just counted the files on one of my Linux systems, and there are 1’128’252 of them. Sounds like looking up a file with a linear search is going to get expensive.

Problem 2: The contents of each file needs to be contiguous.

With our current layout, if a file becomes too big to fit in one block, and the next block is used, then we need to move it:

Problem 3: Metadata is mixed up with data.

Let’s say we want to implement the tree command. It prints the name mode and size of every file in the filesystem, no matter how deeply nested.

With our disk layout, we’d have to read the start of every block that contains a file just to know its metadata. Along with, of course, the contents of every folder (but that’s normal).

This is especially problematic because reading metadata (which happens very often) requires a lot of seeking. Seeking is the process of “jumping to a different location” when reading a storage device.

This is expensive for optical disk readers, or hard disk drives. It is extremely expensive for tape drives.

A DDS Tape drive.

A DDS Tape drive.

Problem 4: Rename operations are extremely expensive.

If we want to rename /etc to /config, we have a bunch of entries to re-write:

This means reading a bunch of entries to find out what we should even rewrite, then write, in many different locations, all across the disk. Remember seeking being expensive? Forget about it.

It gets worse the more subfolders and files a directory has.

Problem 5: Every file operation needs to parse entire paths.

With our scheme, if we want to go from /etc/hosts to /etc, there’s no direct link! We have to parse /etc/hosts into ["etc", "hosts"], remove the last one to get ["etc"] and build the /etc path from that.

Then we have to use our lookup table at the very beginning to find where /etc actually is.

All in all, this is a pretty bad disk layout. It would technically work for a toy operating system, but I wouldn’t want to use it as a daily driver.

What did we learn?

A filesystem’s disk layout must be carefully designed.

Operations like enumerating files, traversing downwards or upwards, renaming folders, should be achievable in as few operations as possible, minimizing seeking.

Enter inodes

One of the lessons we learned from our first attempt at a filesystem disk layout is that we should probably separate metadata and data. We should also find another way to refer to files than by their fully qualified names.

Whether our file lives at /etc/hosts our /config/hosts - it’s the same file! And our scheme should reflect that.

So, let’s start by storing all the metadata next to each other, as early as possible. We need to reserve a big chunk of space, large enough to describe every file on disk.

This will be our index, each element is an “index node”, or “inode”.

Each of those inodes will contain the usual, but also, it’ll contain the number of the block we can find its data in:

As for folders - easy! In the data block, we store a list of directory entries. Each entry refers to another inode, and gives it a name.

Let’s review. Did we actually solve our problems? Firstly, how do we look up a file in this new scheme?

Let’s say we want to open /etc/hosts. First we read the inode for /. For convenience’s sake, let’s decide that, since it’s the filesystem root (topmost) node, it will always be in inode 1:

Then we read its directory entries, look for “etc”. Then we read “etc”’s inode. Then we read its directories entires, look for “hosts”. Finally, we read “hosts”’s data.

I made a quick diagram so it’s easier to follow:

Ok, so it’s not easier to follow.

But there is one upside compared to our previous scheme. Instead of doing a linear search among all the files in the filesystem, by absolute paths, we search across much smaller sets of directory entries, multiple times:

Cool bear's hot tip

The diagram above is not to scale: we just reduced the worst case from 1.1M string comparisons to 228 string comparisons.

The new worst case is 0.02% of the old one.

Of course, this depends how many files there are per directory!

What about rename operations?. You might have noticed in our new scheme, the inodes don’t contain path information - there’s no name field in there.

The names are actually contained in the directory entries:

If we want to rename /etc, we just need to change the directory entry - we don’t need to touch the inodes of any of the children, no matter how deep the subdirectory hierarchy goes:

If we want to move /etc/hosts to /hosts, we simply remove the directory entry in /etc’s inode, and add one to /’s inode.

Again, it’s a flat number of operations, no matter how many directories and files our directories have. This is much better.

Right now, traversing downwards is efficient (more than before), but traversing upwards (going from /etc to /, for example) still requires manipulating paths, and we don’t want to do that.

Here’s an idea: what if we add an entry to each directory, that points to its parent? We could call it “..”. And for the root, we’ll just make it point to itself:

Solving the remaining problems

I don’t know about you, but I’m pretty happy with our disk layout so far.

However, there’s (at least) two things we could be doing better.

Right now, every directory entry is stored in no particular order, in a list. To find a specific entry, we have to do a linear search, ie. do a comparison with each entry one after the other, and stop only when we’ve found what we’ve looking for.

This is only practical if directories never have more than a few thousand files. If we get into the tens of thousands, or hundreds of thousands territory, then things will start to be really slow.

A potential solution is to hash the names of all the entries, and use a Self-balancing binary search tree.

I won’t get into the details here, but basically, it allows insertion (adding an entry), deletion (removing an entry), and searching in logarithmic time, which means it will get expensive much slower than a linear data structure.

Cool bear's hot tip

“Data structures” is a rich and complex field, which would warrant many, many articles.

Let’s no go there today.

The second problem is that our files still have to be contiguous. If they grow, we might have to move them to another set of blocks. We might even find that there is no contiguous set of blocks large enough to hold the file!

To address that, we can use extents.

For each file, we can store a series of (start, length) pairs:

This brings on a new problem: previously, if we wanted to read just the second half of a file, we could simply calculate the address of the first block:

But now, the middle of the file may be in any of the extents. It’s not simple arithmetic anymore. To remedy this, we can also use a tree data structure.

If we want to read the file from the beginning, then we can simply walk the tree in order. If we’re seeking somewhere else, we’ll simply do a search through the tree until we find the right extent, and then so simple arithmetic.

Now for a taste of the real world

Enough playing around! It’s time to look at a real live filesystem.

Before we started thinking about disk layouts, we were tracing kernel events while opening and mapping /etc/hosts.

The trace we got contained lines like:

ext4_ext_map_blocks_enter: dev 8,1 ino 1311701 lblk 0 len 3 flags 
...
ext4_ext_map_blocks_enter: dev 8,1 ino 1840090 lblk 0 len 1 flags 

This makes a little more sense now.

The first line refers to inode “1311701” (we now know what this means). The _ext_ part of the event name probably refers to extents, because there’s a “len 3” (length of 3).

Could it be /etc/hosts?

Cool bear's hot tip

On Linux, we can pass the -i flag to the ls command, to make it display inode numbers.

Nope! /etc/hosts is 1840090. That’s the second line.

What’s the first one then?

Well, remember when I said binaries were mapped when they got executed?

Bingo.

So as far as we can tell, the real file system this Linux system is using is also using inodes. And extents!

We can find out what filesystem it is with the df command (specifically, the -T flag):

So, apparently it’s ext4. It appears the most up-to-date documentation for ext4 is on kernel.org.

Unfortunately, it seems this documentation wasn’t written by the authors of ext4, but rather by someone else reading the code for ext4 in the Linux kernel. Oh well, that’s life.

But first: what’s this /dev/sda1?

Well, remember when I said the kernel was boss? And that it controlled the reality userland processes see? That goes doubly for files.

Not all things that can be open()’d are files in the traditional sense.

Some are just.. resources.

For example, /proc/cpuinfo “contains” information about the installed CPUs:

It’s not actually on any storage device. It’s just a way for the kernel to give userland processes some information, through standard system calls like open, read, write and close.

/proc/self/ is a directory that contains information about the currently running process. If we modify our Go program like so:

package main

import (
	"fmt"
	"io/ioutil"
)

func main() {
	payload, _ := ioutil.ReadFile("/proc/self/cmdline")
	fmt.Printf("%s\n", string(payload))
}

…and run it like so, it’ll print its arguments:

In fact, in The Unix Philosophy, everything is a file.

Cool bear's hot tip

This has made a lot of people angry.

So, /dev/sda1 is simply a “file” that contains the entire contents of our root ext4 partition.

If we wanted the whole disk we’d open /dev/sda instead. But then we’d have to worry about partition tables, and that’s better left for later.

Let’s read a whole partition I guess

Now that we’re done with the introductory material, let’s jump right into it.

We’ll be using rust for this next part.

use std::fs::OpenOptions;
use std::io::Read;
use hex_slice::AsHex;

fn main() -> Result<(), std::io::Error> {

    // open our ext4 partition, READ-ONLY.
    let mut file = OpenOptions::new().read(true).open("/dev/sda1")?;

    // allocate a 128-byte buffer
    let mut buf = vec![0u8; 128];

    // read the first 128 bytes of the file
    file.read_exact(&mut buf)?;

    // print it as hexadecimal
    println!("{:x}", buf.as_hex());

    Ok(())
}

This builds happily with cargo build, but it won’t run:

See, we’re currently running this program as a normal user. And if any user could have access to any partition willy-nilly, then they could bypass file permissions.

And that would be bad.

So let’s pretend we’re root for a hot minute and:

Ah. It’s full of zeroes. We might need that documentation after all.

Here’s what it has to say:

For the special case of block group 0, the first 1024 bytes are unused, to allow for the installation of x86 boot sectors and other oddities. The superblock will start at offset 1024 bytes, whichever block that happens to be (usually 0).

Okay then, let’s read starting from byte 1024. We’ll use the positioned-io crate for that.

// instead of std::io::Read
use positioned_io::ReadAt;

fn main() -> Result<(), std::io::Error> {
    // (cut)

    // read the first 128 bytes of the file
    file.read_exact_at(1024, &mut buf)?;

    // (cut)
}

Now we’re cooking! The docs mention a “superblock”, and if we read up on its structure, it says that we should find the magic number 0xEF53 at offset 0x38. It also says it’s a little-endian 16-bit integer.

Well, there’s a crate for that.

Since we’re going to be reading a lot of stuff, let’s make a helper struct. We’ll need a few more use directives:

// this will let us read integers of various width
use byteorder::{LittleEndian, ReadBytesExt};

// a cursor keeps track of our position within any
// resource that implements ReadAt.
use positioned_io::{ReadAt, Cursor};

// 'failure' gives us backtraces when we return
// errors.
use failure::Fallible;

type Result<T> = std::result::Result<T, failure::Error>;

// we'll be able to read from any type that
// implements ReadAt.
struct Reader<IO: ReadAt> {
  inner: IO,
}

impl<IO: ReadAt> Reader<IO> {
    fn new(inner: IO) -> Self {
        Self { inner }
    }

    fn u16(&self, offset: u64) -> Fallible<u16> {
        let mut cursor = Cursor::new_pos(&self.inner, offset);
        Ok(cursor.read_u16::<LittleEndian>()?)
    }
}

Now, any time we want to read something from the partition, we’ll first make a Slice, pass it to a Reader, and use it.

Cool bear's hot tip

Note that neither the File, Slice, or Reader need to be mutable, because none of them maintain position information.

They can even be used concurrently!

// using our new result type:
fn main() -> Result<()> {
    let file = OpenOptions::new().read(true).open("/dev/sda1")?;

    // create a slice that corresponds to the superblock
    let r = Reader::new(Slice::new(file, 1024, None));

    // as per the docs:
    let magic = r.u16(0x38)?;
    println!("magic = {:x}", magic);

    Ok(())
}

Hey, that’s the value the documentation gave us! We’re on the right track.

Block groups

In our toy disk layout, we divided the disk in blocks. ext4 does just the same thing! At offset 0x18 of the super block, we find the size of a block.

Well, we find n, where the block size is 2 ^ (10 + n).

(Note: for this part, we need to add a u32 method to our Reader type. It’s modelled exactly after u16, so I won’t add it here).

What luck! The blocks are 4KB, just like in our toy disk layout.

However, if we look at the docs, we’ll notice the blocks are grouped (into.. block groups).

And those are described in… group descriptors (GDT stands for “group descriptor table”):

The number of blocks per group is stored in the superblock, at offset 0x20:

    // in main
    let bpg = r.u32(0x20)?;
    println!("blocks per group = {}", bpg);

Finally, the number of inodes per group is stored at offset 0x28;

    // in main
    let ipg = r.u32(0x28)?;
    println!("inodes per group = {}", ipg);

The good news is that all those values look legit. The bad news is that we’re nowhere near reading an inode.

Reading an inode in ext4

Let’s recap. An ext4 partition starts with 1024 bytes of padding, then a superblock, which contains various settings like the size of a block, the number of blocks per group, the number of inodes per group, etc.

Then come the block group descriptors. Those contain the offset of a few things, but the one we’re interested in is the inode table:

Let’s start at the beginning - find the inode for /.

There’s a few special inodes in ext4:

The documentation seems about as confident as we are. Those question marks there are.. just great.

Anyway, the inode for the root, /, is always 2. How do we find it? Again, the docs help:

And that makes sense!

Since we’re using a good programming language, let’s start building out a few abstractions.

First, the superblock:

// this is like derive(Debug), but better.
// see https://crates.io/crates/custom_debug_derive
use custom_debug_derive::*;

#[derive(CustomDebug)]
struct Superblock {
    #[debug(format = "{:x}")]
    magic: u16,
    block_size: u64,
    blocks_per_group: u64,
    inodes_per_group: u64,
}

impl Superblock {
    fn new(dev: &dyn ReadAt) -> Result<Self> {
        let r = Reader::new(Slice::new(dev, 1024, None));
        // note: we're casting a few fields to `u64` now.
        // this will save us a bunch of grief later.
        Ok(Self {
            magic: r.u16(0x38)?,
            block_size: (2u32.pow(10 + r.u32(0x18)?)) as u64,
            blocks_per_group: r.u32(0x20)? as u64,
            inodes_per_group: r.u32(0x28)? as u64,
        })
    }
}

Let’s take it for a spin:

fn main() -> Result<()> {
    // open our ext4 partition, READ-ONLY.
    let file = OpenOptions::new().read(true).open("/dev/sda1")?;

    let sb = Superblock::new(&file)?;
    println!("{:#?}", sb);

    Ok(())
}

Seems fine!

Now let’s make a type for block group descriptors. We’re going to use it as a container for a constant: the size of a block group descriptor:

struct BlockGroupDescriptor {}

impl BlockGroupDescriptor {
    const SIZE: u64 = 64;
}

Next, let’s make a type for block group numbers. Sure, it’s just a number - but we’ll add a method on it to get a slice of the descriptor.

Cool bear's hot tip

A tuple or struct with a single field is called a newtype.

#[derive(Debug, Clone, Copy)]
struct BlockGroupNumber(u64);

impl BlockGroupNumber {
    fn desc_slice<T>(self, sb: &Superblock, dev: T) -> Slice<T>
    where
        T: ReadAt,
    {
        assert!(sb.block_size != 1024, "1024 block size not supported");
        // the superblock takes up 1 block
        let gdt_start = 1 * sb.block_size;
        let offset = gdt_start + self.0 * BlockGroupDescriptor::SIZE;
        Slice::new(dev, offset, None)
    }
}

Next up, let’s make a type for inode numbers! We’ll add a method that tells us in which block group a particular inode is.

#[derive(Debug, Clone, Copy)]
struct InodeNumber(u64);

impl InodeNumber {
    fn blockgroup_number(self, sb: &Superblock) -> BlockGroupNumber {
        let n = (self.0 - 1) / sb.inodes_per_group;
        BlockGroupNumber(n)
    }
}

Let’s try this all out. We’re looking for inode 2:

fn main() -> Result<()> {
    // open our ext4 partition, READ-ONLY.
    let file = OpenOptions::new().read(true).open("/dev/sda1")?;

    let sb = Superblock::new(&file)?;
    println!("{:#?}", sb);

    let root_bg = InodeNumber(2).blockgroup_number(&sb);
    dbg!(&root_bg);

    let mut buf = vec![0u8; 64];
    root_bg.desc_slice(&sb, &file).read_at(0, &mut buf)?;
    println!("{:x}", buf.as_hex());

    Ok(())
}

As expected, inode 2 is in block group 0, and the descriptor slice, well, is not all zeroes, so that’s a start.

The only thing we care about in the block descriptor is the offset of the inode table. Like most other locations, it’s a block number.

This one’s a bit annoying, because the lower 32-bits are stored at 0x8, and the upper 32-bits are stored at 0x28.

We could do the bit twiddling directly in BlockGroupDescriptor, but let’s add a convenience method to Reader instead:

// in `impl Reader {`
    fn u64_lohi(&self, lo: u64, hi: u64) -> Fallible<u64> {
        Ok(self.u32(lo)? as u64 + (self.u32(hi)? << 32) as u64)
    }

And let’s fill out BlockGroupDescriptor:

#[derive(Debug)]
struct BlockGroupDescriptor {
    inode_table: u64,
}

impl BlockGroupDescriptor {
    const SIZE: u64 = 64;

    fn new(slice: &dyn ReadAt) -> Result<Self> {
        let r = Reader::new(slice);
        Ok(Self {
            inode_table: r.u64_lohi(0x8, 0x28)?,
        })
    }
}

Now, we can add a method to directly grab the block group descriptor to BlockGroupNumber:

    // in impl BlockGroupNumber
    fn desc(self, sb: &Superblock, dev: &dyn ReadAt) -> Result<BlockGroupDescriptor> {
        let slice = self.desc_slice(sb, dev);
        BlockGroupDescriptor::new(&slice)
    }

Again, let’s try it:

Hey, 1070 looks like a pretty reasonable number for the inode table of the first block group.

✨ We’re making progress ✨

Now, the inode table is a series of fixed-size inodes.

The first field is a 16-bit (little-endian, still) integer that contains the file mode. A bit later on, we find the inode’s size, again divided between lower 32 bits (at 0x4) and upper 32 bits (at 0x6C).

At 0x28, we find the “block map or extent tree”. It’s a block of 60 bytes. I have a feeling we’re going to need that as well!

Let’s add a convenience method to reader so we can grab that block easily:

    // in impl Reader
    fn vec(&self, offset: u64, len: usize) -> Fallible<Vec<u8>> {
        let mut v = vec![0u8; len];
        self.inner.read_exact_at(offset, &mut v);
        Ok(v)
    }

Next, let’s add a method to grab the slice for a specific inode number.

    // in impl InodeNumber
    fn inode_slice<T>(self, sb: &Superblock, dev: T) -> Result<Slice<T>>
    where
        T: ReadAt,
    {
        let desc = self.blockgroup_number(sb).desc(sb, &dev)?;
        let table_off = desc.inode_table * sb.block_size;
        let idx_in_table = (self.0 - 1) % sb.inodes_per_group;
        let inode_off = table_off + sb.inode_size * idx_in_table;
        Ok(Slice::new(dev, inode_off, Some(sb.inode_size)))
    }

Now, let’s make an Inode type, with all the data we wanted: mode, size, and the 60-byte block. We’ll use Reader::vec which we added recently:

#[derive(CustomDebug)]
struct Inode {
    #[debug(format = "{:o}")]
    mode: u16,
    size: u64,

    #[debug(skip)]
    block: Vec<u8>,
}

impl Inode {
    fn new(slice: &dyn ReadAt) -> Result<Self> {
        let r = Reader::new(slice);
        Ok(Self {
            mode: r.u16(0x0)?,
            size: r.u64_lohi(0x4, 0x6C)?,
            block: r.vec(0x28, 60)?,
        })
    }
}

And finally, let’s add a method to InodeNumber that directly reads the inode:

    // in impl InodeNumber
    fn inode(self, sb: &Superblock, dev: &dyn ReadAt) -> Result<Inode> {
        let slice = self.inode_slice(sb, dev)?;
        Inode::new(&slice)
    }

We’re ready. Let’s try it:

fn main() -> Result<()> {
    // open our ext4 partition, READ-ONLY.
    let file = OpenOptions::new().read(true).open("/dev/sda1")?;

    let sb = Superblock::new(&file)?;
    println!("{:#?}", sb);

    let root_inode = InodeNumber(2).inode(&sb, &file)?;
    println!("{:#?}", root_inode);

    Ok(())
}

Hurray! The values match what the stat command reports. I’m sure there’s lots of directories with the same permissions and size, so we could be reading the wrong inode, but at least we’re probably not reading, say, a random data block instead.

Let’s quickly add a Filetype enum so we can make sure it’s a directory. The file type is contained in the mode:

We’ll use the num_enum crate here:

use num_enum::*;
use std::convert::TryFrom;

#[derive(Debug, TryFromPrimitive)]
#[repr(u16)]
enum Filetype {
    Fifo = 0x1000,
    CharacterDevice = 0x2000,
    Directory = 0x4000,
    BlockDevice = 0x6000,
    Regular = 0x8000,
    SymbolicLink = 0xA000,
    Socket = 0xC000,
}

And add a getter to Inode:

    // in impl Inode
    fn filetype(&self) -> Filetype {
        Filetype::try_from(self.mode & 0xF000).unwrap()
    }
Cool bear's hot tip

To keep the example code concise, we use unwrap() here.

If we somehow read an inode from the wrong slice, or if the inode has an invalid mode, our program will panic.

This is not desirable in a real filesystem implementation, but here it works as a nice sanity check.

Let’s give it a shot:

    // in main
    let root_inode = InodeNumber(2).inode(&sb, &file)?;
    println!("({:?}) {:#?}", root_inode.filetype(), root_inode);

Everything looks fine!

Now that we’re sure we’re dealing with a directory, we can start reading its entries.

But first - remember extents? Those are in the block we just read!

The kernel documentation is here to help.

#[derive(Debug)]
struct ExtentHeader {
    entries: u64,
    depth: u64,
}

impl ExtentHeader {
    fn new(slice: &dyn ReadAt) -> Result<Self> {
        let r = Reader::new(slice);
        let magic = r.u16(0x0)?;
        assert_eq!(magic, 0xF30A);

        Ok(Self {
            entries: r.u16(0x2)? as u64,
            depth: r.u16(0x6)? as u64,
        })
    }
}

Let’s read one.

    // in main
    let ext_header = ExtentHeader::new(&Slice::new(&root_inode.block, 0, Some(12)))?;
    println!("{:#?}", ext_header);

Well, it’s not much but - again, the values make sense. It’s followed by a single entry, so we know the data for / is stored in a single range of contiguous blocks. It has a depth of 0, which means it’s a leaf node.

Cool bear's hot tip

In our case, we’ll assume that /, /etc, and /etc/hosts all only have one extent node.

This makes sense, since there’s only 24 children for /, only 204 children for /etc, and /etc/hosts is shorter (511) than a block (4096).

However, we’ll verify our assumptions with assert!.

#[derive(Debug)]
struct Extent {
    len: u64,
    start: u64,
}

impl Extent {
    fn new(slice: &dyn ReadAt) -> Result<Self> {
        let r = Reader::new(slice);
        Ok(Self {
            len: r.u16(0x4)? as u64,
            // the block number the extent points to is split
            // between upper 16-bits and lower 32-bits.
            start: ((r.u16(0x6)? as u64) << 32) + r.u32(0x8)? as u64,
        })
    }
}

Our complete extent reading code becomes:

    // in main, after finding root_inode
    let ext_header = ExtentHeader::new(&Slice::new(&root_inode.block, 0, Some(12)))?;
    println!("{:#?}", ext_header);

    assert_eq!(ext_header.depth, 0);
    assert_eq!(ext_header.entries, 1);

    let ext = Extent::new(&Slice::new(&root_inode.block, 12, Some(12)))?;
    println!("{:#?}", ext);

Hey, those are still reasonable values!

According to this, we can find the data for / at block 9262.

Let’s take a moment to add a convenience method to the Inode type:

    fn data<T>(&self, sb: &Superblock, dev: T) -> Result<Slice<T>>
    where
        T: ReadAt,
    {
        let ext_header = ExtentHeader::new(&Slice::new(&self.block, 0, Some(12)))?;
        assert_eq!(ext_header.depth, 0);
        assert_eq!(ext_header.entries, 1);

        let ext = Extent::new(&Slice::new(&self.block, 12, Some(12)))?;
        assert_eq!(ext.len, 1);

        let offset = ext.start * sb.block_size;
        let len = ext.len * sb.block_size;
        Ok(Slice::new(dev, offset, Some(len)))
    }

And then dump the first 128 bytes of /’s data, as if it were a string:

    // in main
    let root_inode = InodeNumber(2).inode(&sb, &file)?;
    println!("({:?}) {:#?}", root_inode.filetype(), root_inode);

    let root_data = root_inode.data(&sb, &file)?;
    let data_start = Reader::new(&root_data).vec(0, 128)?;
    println!("{}", String::from_utf8_lossy(&data_start));

Now, I don’t know about you, but this looks very exciting to me.

We can see some of the top-level directory names in there: proc, dev, bin - and even . and ..! (the current directory and parent directory, respectively).

Let’s read them proper: the directory entry format is actually pretty straightforward.

#[derive(CustomDebug)]
struct DirectoryEntry {
    #[debug(skip)]
    len: u64,
    inode: InodeNumber,
    name: String,
}

impl DirectoryEntry {
    fn new(slice: &dyn ReadAt) -> Result<Self> {
        let r = Reader::new(slice);
        let name_len = r.u8(0x6)? as usize;
        Ok(Self {
            inode: InodeNumber(r.u32(0x0)? as u64),
            len: r.u16(0x4)? as u64,
            name: String::from_utf8_lossy(&r.vec(0x8, name_len)?).into(),
        })
    }
}

We’ll simply add a method to Inode that returns a vec of DirectoryEntry. This won’t fit all usecases, but it’ll fit ours!

    // in impl Inode
    fn dir_entries(&self, sb: &Superblock, dev: &dyn ReadAt) -> Result<Vec<DirectoryEntry>> {
        let data = self.data(sb, dev)?;

        let mut entries = Vec::new();
        let mut offset: u64 = 0;
        loop {
            let entry = DirectoryEntry::new(&Slice::new(&data, offset, None))?;
            if entry.inode.0 == 0 {
                break;
            }
            offset += entry.len;
            entries.push(entry);
        }
        Ok(entries)
    }

Now for the moment of truth:

    // in main:
    let root_inode = InodeNumber(2).inode(&sb, &file)?;
    println!("({:?}) {:#?}", root_inode.filetype(), root_inode);

    let root_entries = root_inode.dir_entries(&sb, &file)?;
    println!("{:#?}", root_entries);

🎉🎉🎉

Everything appears to be in order.

Let’s add a convenience method to find a particular child:

    // in impl Inode
    fn child(&self, name: &str, sb: &Superblock, dev: &dyn ReadAt) -> Result<Option<InodeNumber>> {
        let entries = self.dir_entries(sb, dev)?;
        Ok(entries
            .into_iter()
            .filter(|x| x.name == name)
            .map(|x| x.inode)
            .next())
    }
Cool bear's hot tip

Why is the return type Result<Option<T>>?

Well, reading from the partition might fail - and that would return Err(e).

Reading from the partition might succeed, but there might be no child with this name. That would return Ok(None).

Finally, reading might succeed, and we might find such a child, and that would return Ok(Some(n)).

And let’s use it:

    // in main
    let root_inode = InodeNumber(2).inode(&sb, &file)?;
    println!("({:?}) {:#?}", root_inode.filetype(), root_inode);

    let etc_inode = root_inode.child("etc", &sb, &file)?;
    println!("{:#?}", etc_inode);

Everything checks out.

Let’s go one step further and read that inode too:

    // in main
    let root_inode = InodeNumber(2).inode(&sb, &file)?;
    println!("({:?}) {:#?}", root_inode.filetype(), root_inode);

    let etc_inode = root_inode
        .child("etc", &sb, &file)?
        .expect("/etc should exist")
        .inode(&sb, &file)?;
    println!("({:?}) {:#?}", etc_inode.filetype(), etc_inode);

Still good. Can we find hosts in there?

    // in main
    let hosts_inode = etc_inode
        .child("hosts", &sb, &file)?
        .expect("/etc/hosts should exist")
        .inode(&sb, &file)?;
    println!("({:?}) {:#?}", hosts_inode.filetype(), hosts_inode);

Excitement intensifies

And can we read it as a string??

    // in main
    let hosts_data = hosts_inode.data(&sb, &file)?;
    let hosts_data = Reader::new(&hosts_data).vec(0, hosts_inode.size as usize)?;
    let hosts_data = String::from_utf8_lossy(&hosts_data);
    println!("---------------------------------------------");
    println!("{}", hosts_data);

Cool bear's hot tip

You can find the complete code listing on GitHub Gist.

It clocks in at exactly 300 lines of Rust! (Counting empty lines and punctuation lines).

This article is part of the series Reading files the hard way:

This article was made possible thanks to my patrons: Aurora, Jesús Higueras, Ryszard Sommefeldt, Jérémy Gtld, Someone, Xananax, Nicolas Goy and Jane Lusby.

If you liked this article, please support my work on Patreon!

Become a Patron