Me

The simple thoughts of Jonathan J Hunt

Securely Serving From Github Pages Using Cloudflare (Free)

Do you see a little green icon in your browser bar? This website is now served using TLS (a secure connection also known as SSL) thanks to Cloudflare’s new free Universal SSL1. Combined with Github Pages hosting this website securely costs me nothing. Here I explain how to connect these services together so you can get your own green icon. I’ll assume you’re already familiar with Github Pages and how to publish your site using it.

As a bonus, in addition to making the web slightly better for everyone2, your site will probably resolve and load faster (Cloudflare caches your site at edge locations around the world and supports a new faster protocol called SPDY) and comes with IPv6 support.

On the other hand, be aware this free solution has some limitations and is not appropriate for if your site deals with sensitive or personal information. In particular:

  • If you use your own custom domain (like I do) the connection between Cloudflare servers and Github is not secured at all. This connection is generally going to be over high-speed backhaul connections that are much harder for an attacker to get at (compared with the wifi link in the coffee shop) but its certainly not safe against state level or sophisticated attackers. This limitation is because Github does not currently support secure connections for custom domains.

  • Older browsers will not be able to access your site securely (and if you force all connections to be secure, as I do, they will not be able to access your site at all). Primarily, this affects users of Internet Explor on Windows XP and old Android phones. If this is an issue for you can you use a paid plan to support them (essentially, the only way for Cloudflare to support old browsers is to provide a dedicated IP address for your site which they can’t economically provide to everyone for free - see their blog post for more details).

  • Cloudflare doesn’t currently support adding HSTS headers (telling the user’s browser to always use a secure connection for your website) so, again, this setup is not appropriate for a website dealing with sensitive information.

Setup

I was pleasantly surprised how straightforward it was to set this. I’ve outlined the steps below:

  • Setup an account at Cloudflare

  • Click through and follow the steps to setup Cloudflare for your domain. The defaults are pretty much ok.

  • As instructed you will need to point your DNS nameservers to Cloudflare. You then use Cloudflare to manage DNS for your domain. For Github Pages you should have a CNAME entry pointing to your github pages (e.g. for this site I have a CNAME @ aliased to jjh42.github.io). Make sure any other DNS settings are correct (i.e. for subdomains or MX fields).

  • (Optionally) I lowered the security profile since I’m not a likely target for attacks and I didn’t want any users (for example on Tor) to be forced to solve CAPTCHAs. This is Settings -> Security Profile. You may also want to enable CDN + Basic Optimizations which may help your website load a bit faster.

  • Enable SSL by selecting Flexible SSL in settings (if you are not using a custom domain then I believe you should be able to choose Full (Strict) for a more secure setup but I have not tried this). You may have to be patient at this point as you wait up to 24 hours for DNS settings to propogate and for Cloudflare to generate an SSL certificate for your domain.

  • Now your site should be available via Cloudflare in both secure and insecure versions. Try visiting https://yourwebsite (note the s) and check that it works.

  • Check in Google Chrome (I’m not sure about other browsers) for mixed content errors. Chrome will refuse to load resources on your page that have insecure URLs (since this could be easily used to compromise the integrity of the page) and will show the user a yellow warning icon. Usually, this can be easily fixed by replacing the resource with a secure or protocol agnostic form. For example, I used MathJAX which referred to a resource at http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_HTML which Chrome wasn’t happy with. By editing the URL to be //cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_HTML everythings works fine and the browser will connect securely when on a secure page.

  • Setup a redirect so that visitors are always sent to the secure version of your website. From the Cloudflare console choose Page Rules and add a URL pattern http://yourwebsite/* and select Always use HTTPS and then Add Rule. Now check that when you go to the insecure version of your website you are redirected to the secure version.

  • Relax, knowing that you’ve made the web a slightly better place.

I hope that’s helpful. Feel free to ask questions or provide additional information in the comments below (I can’t promise I’ll have time to respond promptly though).

Be aware that due to Cloudflare’s caching changes to your website might not show up right away. You can always go to the Cloudflare console and put it in developer mode (which will temporarily disable caching so changes show up right away).


  1. Somewhat oddly, Cloudflare’s blog itself is not served over a secure connection.

  2. Even if your site, like mine, is not particularly sensitive, the more connections are secured even for trivial sites the more difficult it becomes for bad actors to snoop or censor.

Btrees Are the New Black

Self-balancing trees are an important class of data structures. In most textbooks, red/black binary trees are introduced first, and then btrees come later, usually mentioned as a data structure used in file systems. Both data structures have asymptotically identical performance ($O(\log n)$ time for insertion, lookup and deletion). Here, I will show why, despite your textbook’s ordering, btrees may often be a better choice.

Btrees (or variants) are often used in file systems because they make better use of devices which read and write blocks of data (such as hard drives or flash). They do this by trading some extra space and computation for increased data locality. Each node in a btree stores between $t$ and $2t - 1$ keys (except for the root node, which might contain less entries). Search may require examining each key in the nodes on the search path and any change to a node may require shifting all of the entries to maintain the sort. However, you will notice that all of these opererations are data local.

What else is block-like these days? Modern (misleadingly named) random access memory. Modern CPUs have multiple layers of caching to improve the performance dealing with the (relatively) slow main memory. My Intel Core 2 Duo has a 64 byte cache line size, meaning that requesting the contents of a (correctly aligned) memory address makes examining the contents of memory address within 64 bytes much cheaper (a full cache miss costs ~165 cycles). Another advantage is that btrees result in less calls to the memory allocator since each node stores more entries.

The full description of red/black trees and btrees is outside the scope of this article. For our purposes we just need a brief description of what the two tree types look like.

A red-black tree is a balanced binary tree. Each node in the tree stores one key to value mapping and points to the left (nodes with a key that is smaller than the current node) and right (nodes that are larger than the current node). As you can see, looking up an item in the tree involves branching at each node to the left or right until termination.

A btree, on the other hand, allows more entries per node (as above, between $t$ and $2t-1$ where $t$ is a parameter of the btree). Each non-leaf node has up to $2t$ children, each child tree containing the nodes intermediate in value between the constituent entries. Searching in a btree requires examining, on average, half of the entries in a node to determine which child to descend into. On the other hand, the depth of the tree is decreased since, even in the worst case, each node has at least $t$ entries (excluding the root node).

For many, binary trees are the standard choice for a tree data structure. With modern architectures this might not be the most efficient choice. Here, we examine the speed of inserting and then retrieving 107 elements. The mapping we are considering is from randomly chosen integers to void pointers.

Unlike binary trees, btrees have a tunable parameter $t$, which specifies the number of elements stored in each node. So first we examine the best choice of the $t$ parameter.

The graph shows the user CPU time for different choices of $t$. As you can see, the optimal choice of $t$ is far-away from a binary tree. The minimum runtime was at $t = 24$ (give or take some noise, but that’s close enough for our purposes). This means nodes store between $24$ to $47$ keys.

We then compared a btree with $t = 24$ against two other commonly chosen data structures. A dynamically sized hash map and a red-black tree. In all cases, I used a straightforward C implmentation and compiled with standard optimizations for benchmarking.

Unsuprisingly, a hash map performs far and above the rest. This is to be expected, mapping is exactly what hash maps are for and, in most situations, they should perform insertions and lookups with amortized $O(1)$ time complexity. However, for situations where you wish to preserve order, or avoid any insertion ever taking more than $O(\log n)$, a tree may be a better choice. For those situations you can see that a well-tuned btree was outperforming a red/black tree by more than 2 times.

As memory architectures begin to behave more like block devices than random access memory, it is worth revisiting algorithm choices that were made when memory was uniform. As a bonus I found a btree easier to reason about and implement (the number of lines needed to implement both a btree and red/black was very similar though).

All of the code for the algorithms are available on github.

Updates

  • Fixed typos. Thanks to a helpful readers for pointing them out.

  • Clarified wording and the amortized analysis in the paragraph on hash trees.

  • Discussion on hacker news. I learned that Google provides a STL-like implementation of btrees for similar reasons to those discussed here.

  • Corrected the date of the blog post.

Atoms/Symbols in Python

I’ve been learning Erlang and one thing I like is the idea of atoms (aka symbols in Ruby I think) 1. Atoms in Erlang are written by just using lowercase. They are pointers to a global string table. This means comparison operations are very fast (you are just comparing two pointers) but for debugging and pattern matching the atom has a meaning (the string).

Atoms are particularly useful in pattern-matching languages like Erlang, but they’re a useful idea even without pattern matching. They’re a very lightweight way of communicating something with a clear semantic meaning (compare returning the atom ok with the integer 0) with very little runtime cost.

There are other solutions such as enums and constant definitions but these suffer from two problems. One, you have to go and declare them, so for a small function its tempting to just hardcode a value. Two, in most languages enums and constants map from a semantic meaning to an arbitrary value (for example, an enum in C is a int). When I return the atom ok I mean ok. Testing of ok == 0 or ok + 1 are meaningless operations. Atoms are restricted - the only thing you can do is test for equality. If my debugger shows the value 0 instead of the meaning, that’s not very helpful.

I was discussing atoms in python with a colleague when some googling revealed the Python almost has atoms. Python has a built-in function intern which will insert a string in a global table ensuring that future equality comparisons with other strings that are interned are a constant time operation. It automatically interns string literals. Thus, python string literals can be used at atoms. The debugger will, of course show the content of the string.

There are some disadvantages to using a string as an atom. For one thing, its not common practice in python so people might get confused. Additionally, strings support additional operations which don’t make sense for atoms (for example: 'ok' + 'error' will happily give you 'okerror'). One solution would be to make a thin wrapper which doesn’t expose these functions.

1
2
3
4
5
6
7
8
9
10
11
class atom(object):
    def __init__(self, a):
        self._a = intern(a)
    def __eq__(self, another_atom):
        return another_atom._a == self._a
    def __ne__(self, another_atom):
        return not self.__eq__(another_atom)
    def __repr__(self):
        return 'atom: ' + self._a
    def __hash__(self):
        return hash(self._a)

Python is going to get enums. Overall, I think the true enums provide many of the advantages of atoms (such as retaining their semantic meaning)2. The main difference is the need to declare them in advance. This is good for big projects it leaves one tempted to do something simpler for a small function.


  1. Erlang did not originate the idea of of atoms - its just where I’m most familiar with them.

  2. For a more negative take on enums in python see this post

You Look Ridiculous

Conservative New Zealand politician Colin Craig threatened satirical website The Civilian with defamation proceedings because they “quoted” him in satirical news article news link. Outrageously, Colin’s lawyer even demanded $500 cash as well as a correction. Luckily, the author of The Civilian responded with a tongue-in-cheek correction and Colin appears to have realized how silly the whole thing makes him look and dropped the whole thing.

Unfortunately, I’m not as a clever as some, so I will respond by writing a boring policy recommendation instead.

This situation really annoys me. I think defamation threats should be rare and made primarily against large publications or figures when they get something wrong that can really damage your reputation and the remedy should often just be a retraction or apology. In practice, they seem to be used as a legal intimidation by a well-funded party to shut someone up they don’t like and they suspect won’t have the means to defend the case (as in this case it often backfires). I’ve been on the receiving end of a threat (I wrote a truthful bad review about a crappy company) and just took down the page. At the time, I was a student and I had nothing like the kind of money I’d need to mount a defense if they followed through.

This situation is bad for society. It allows the strong to strong-arm the weak and impedes freedom of expression. Defamation statutes explicitly defend free expression (for instance, the honest opinion defense in New Zealand law), but in practice this is really only available to people willing to fight an expensive case. Mere mortals, concede or risk bankruptcy.

I think if we are going to maintain defamation as a tort we need to considering reigning in its use as intimidation by:

  • Sanctioning lawyers who write scary letters for nonsense cases. Lawyer’s have a duty to both their clients and society. Writing intimidating letters for cases which they know, or ought to know are rubbish should get them in hot water.

  • Quick and cheap remedies. Particularly for publications which don’t provide any revenue, there ought to be a quick remedy, such as an ombudsman, who can throw out a meritless case (and award costs) so that a poor blogger can speak truth to power with impunity without fearing for their livelihood.

Repealing defamation outright is not so unreasonable either. I don’t think anything really terrible would happen. I think it would be improvement on the status quo, if we can’t fix it. Australia has taken a step in the right direction by forbidding large corporations from claiming defamation.