Mirror of the Rel4tion website/wiki source, view at <http://rel4tion.org>

[[ 🗃 ^yEzqv rel4tion-wiki ]] :: [📥 Inbox] [📤 Outbox] [🐤 Followers] [🤝 Collaborators] [🛠 Commits]

Clone

HTTPS: git clone https://vervis.peers.community/repos/yEzqv

SSH: git clone USERNAME@vervis.peers.community:yEzqv

Branches

master

20.mdwn

[[!template id=ticket class=task assigned=fr33domlover done=yes]]

[[!meta title=“Display commit history for git repos”]]

Issue

The Git repo history log is probably one of the most useful UI elements in a repo’s web view. Add a page to the web app which displays a summary of commits, like a graphical git log.

Plan

If each commit had a single parent, commit traversal would probably be simple. Find the ref for a given branch, say master, and move from each commit to its parent recursively, getting the desired reverse-chronological order. However, the situation isn’t so simple. A commit may have more than one parent. I’m not a git expert, but I know at least one way for that to happen: git merge.

The git history is therefore not a list, but a directed acyclic graph (DAG). Here are some suggestions for traversing the DAG.

Create a hashmap of refs. Go over the DAG in whatever way and push all the refs into the hashmap. Then read the commit data for each ref, sort the commits by reverse chronological order and traverse the list.
Load the refs into a graph using some graph library. There are several. For example, containers has Data.Graph and there’s also dawg and related packages. Since in my case the number of links is small, it may be more efficient to use an edge list and not an association matrix. Anyway, once the graph is built, run some sort of traversal loop on it. Regular DFS and BFS won’t work, but topological order should be fine. However, Git rev-list uses time based order by default, which isn’t the same as topological. Find out what I need.

Go to the Gogs code and see how it builds the commit history page. Result: It seems to be using regular git log. But it uses the command’s “skip” and “max count” parameters for pagination.

There seems to be no existing pure-Haskell solution for that, otherwise I could just use it. The gitlib package’s hit backend is incomplete, and commit traversal happens to be one of the things not implemented yet. Other (C based) backends do have it, but I’d like to try doing this in Haskell directly.

IDEA: Time based order may mix commits from different branches. Instead, when a commit has more than one parent, pick the first parent to check using the commit time. In other words, a specific topological sort.

Progress

Since I don’t remember exactly how topological sort works, I’m using Wikipedia. There are 2 algorithms there. I’ll start with the first one. Here’s an initial version without any particular unique order:

L <- Empty list that will contain the sorted elements
S <- Set of all nodes with no incoming edges
while S is non-empty do
  remove a node n from S
  add n to tail of L
  for each node m with an edge e from n to m do
    remove edge e from the graph
    if m has no other incoming edges then
      insert m into S
if graph has edges then
  return error (graph has at least one cycle)
else
  return L (a topologically sorted order)

Let’s adapt the wording to use some git terms:

-- A list that will contain the sorted commits, latest commit first
L <- Empty list
-- Set of all commits with no parents left in the graph
S <- Set containing a single commit, the last one in the chosen git branch
while S is non-empty do
  remove a commit n from S
  add n to tail of L
  for each commit m with an edge e from n to m, i.e. m is a parent of n, do
    remove edge e from the graph
    if m has no other incoming edges, i.e. no other children, then
      insert m into S
if graph has edges then
  return error (graph has at least one cycle)
else
  return L (a topologically sorted order)

Here’s an example commit graph from man git-rev-list:

  .-A---M---N---O---P---Q
 /     /   /   /   /   /
I     B   C   D   E   Y
 \   /   /   /   /   /
  `-------------'   X

I is the initial commit. The A-Q line is the first parent of each merge.

The characteristics we want to achieve (goals):

(G1) Given a sequence of n >= 2 commits C1 ... Cn, where C1 and Cn have at most 2 edges each and the rest have exactly 2, the sequence must appear as-is in the sorted order. For example, take =X-C1-C2-Y=. X and Y 3 edges each, or more. C1-C2 is a pair of commits made to a side branch, and then merged into Y. Then the sequence C1-C2 should appear in the sorted order without anything in between.
(G2) When a branch splits into several lines of parallel development (for example the branch itself and several side branches), and the side lines are then merged into the main line at a single point, the order in which they should appear (each appears as-is due to the previous bullet point) is the reverse chronological order of the last commit of each line.

The parts of the algorithm we need to choose (variables):

(V1) How does S work? Is it a set, a stack, a queue? When we remove a commit from S, how do we choose which commit to pick?
(V2) When we iterate over the parents of a commit, in which order do we do so? Does it matter?

Let’s try to satisfy G2. When need to pick a commit from S, suppose we always take the latest commit, according to the commit’s datetime field. If we do that, we’ll at least start later lines before ealier ones. And if we assume we already satisfied G1, it’s enough. But this simple solution works only under a certain condition. It will fail if we have the following case.

(master) ?--?--?--P1--C
                \    /
(fix-bug)        ?--P2

Suppose P1 was commited earlier than P2 chronologically. If we have both P1 and P2 in S, then using the suggested solution, we will pick P2 first which is correct. If we have only P2 in S while P1 is still waiting to be visited, same result. If we have neither in S yet, fine too. But what if we have P1 in S while P2 hasn’t reached S yet? That would ruin the solution. Is this condition possible?

Let’s follow the algorithm to find out. At some point, we take C out of S. Then we visit P1 and P2 in some order. Since C is the only child of both, both of them will reach S. Therefore we are safe. But what if one of them has another child:

(master) ?--?--?--P1--C
                \    /
(fix-bug)        ?--P2
                     \
(add-feature)         D

This will be a problem only if both C and D are ancestors of our branch’s last commit. But I have an idea: Since using the time of the last commit of a line is just a heuristic, what if instead of using the time, we use the order of the parents? Let’s try that for simplicity. Use the reverse order, i.e. first the merged branches, and then, last, the master branch.

Hmmm that last simplification doesn’t seem to help. New example:

(master) ?--?--?--P1--C---A
                \    /   /
(fix-bug)        ?--P2--B

In this example, iterating by reverse parent order seems to solve the problem. I can try to find an example where it doesn’t, but since I don’t (yet) exactly understand git commit graphs, I prefer to decide for now that we use the commit time. The problem arises if we pick C out of S before we pick B. Let’s assign times to arrange for that. The changes will be:

P1 becomes P and P2 becomes Q
X(n) means a commit labeled X with time n, where bigger n means later time (suppose n means days since the project started)

(master) ?–?–?—P(1)—C(4)—A(5) / / (fix-bug) ?——Q(2)—B(3)

Valid results:

A, C, B, Q, P

The algorithm will do the following:

1. L <- []                -- L = []        ; S = ?
2. S <- {A}               -- L = []        ; S = {A}
3. Remove A from S to L   -- L = [A]       ; S = {}
4. Insert B and C to S    -- L = [A]       ; S = {B, C}
5. Remove C from S to L   -- L = [A, C]    ; S = {B}
6. Insert P to S          -- L = [A, C]    ; S = {B, P}
7. Remove B from S to L   -- L = [A, C, B] ; S = {P}
8. Insert Q into S        -- L = [A, C, B] ; S = {P, Q}
9. Remove ...

This is going to be fine.

Let’s go back to the parent order.

B should come before C because it’s the merged parent
Q should come before P similarly

Maybe G2 isn’t even needed. Maybe it’s a too tight condition. Let’s try G1 and see what happens.

(master) A--B--C--D--E--F--G--H
                \            /
(fix-bug)        W--X---Y---Z

The order I want is: H, Z, Y, X, W, G, F, E, D, C, B, A

Let’s run the algorithm.

1. L <- []                -- L = []          S = ?
2. S <- {H}               -- L = []          S = {H}
3. Move H from S to L     -- L = [H]         S = {}
4. Insert G and Z to S    -- L = [H]         S = {G, Z}
5. Hmmmm

How do we choose between G and Z now? The order should be: Pick Z first and continue picking its parent until W. Idea: Use a stack for that. Rules:

When going over parents, go in parent order (i.e. merged last)
S is a stack, i.e. last in first out

Trying again.

1.  L <- [                 -- L = [          S =   ?
2.  S <- H]                -- L = [          S =  H]
3.  Pop H from S to L      -- L = [H         S =   ]
4.  Push G then Z to S     -- L = [H         S = ZG]
5.  Pop Z from S to L      -- L = [HZ        S =  G]
6.  Push Y to S            -- L = [HZ        S = YG]
7.  Pop Y from S to L      -- L = [HZY       S =  G]
8.  Push X to S            -- L = [HZY       S = XG]
9.  Pop X from S to L      -- L = [HZYX      S =  G]
10. Push W to S            -- L = [HZYX      S = WG]
11. Pop W from S to L      -- L = [HZYXW     S =  G]
12. Pop G from S to L      -- L = [HZYXWG    S =   ]
13. ...

Good, it’s going to work. Since it fully decides how to pick from S and how to traverse parents, maybe it’s enough. Now I have 2 options:

Implement the algorithm by myself
Use a graph library that can do this kind of topsort

Before I choose some existing library, I’d like to try naively building a solution myself. For simplicity, assume no pagination. Just make a single list. Ideas:

L can be a list to which we prepend, and reverse when done, or it can be a different sequence type with efficient append, such as DList or Seq or vector
S is a stack, can be easily implemented using a list
Build a HashMap in which they key is Ref and the value is a pair of Commit and Int, the latter being the number of the commit’s children

How to build the hashmap:

Make a singleton hashmap containing the last branch commit with 0 parents
Read each parent of that commit, insert into hashmap with 1 children
For each such parent, iterate over its children. For each child, check if it’s already in the hashmap. If yes, increase its child count. If not, insert with count 1 and recursively do steps 2 and 3 on it.

Hmmm I seriously doubt it will be faster than using a graph library. But there is one thing I could optimize here: There isn’t really need to remove edges from the graph, especially if checking for existence of incoming edges is a linear-time operation. Anyway, let’s just try the naive approach.

There are 3 graph libraries which seem reasonable:

[[!hackage containers]]
[[!hackage fgl]]
[[!hackage graph-core]]

I’m not sure which one is the best in my case. I’d like to start by trying FGL.

Result

Done for now. Works for a single branch, haven’t tested yet on repos which have multiple branches and merges between them.

[See repo JSON]