Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [jgit-dev] Distributed Git Server using JGit

Hi all,

Thanks for your quick reply. I didn't pick the right words when starting the discussion.
I don't want to build a "scalable" solution, but really just a "high-available" one. :)
So the `N` is either 1 or 2.

@Matthias
> For option 1 I'd recommend you give Gerrit with its high-availability plugin a try
> and if you face issues collaborate with the Gerrit community to improve this solution
> instead of starting your own implementation which is for sure more work.

> For option 2 you may consider to join an initiative started by Luca Milanesio, one
> of the Gerrit maintainers, to implement an open source implementation of the
> JGit DFS API on Cassandra. The current PoC patch series is maintained here

I'll take a look into these options with my colleagues.

> I don't get why you need a scalable server if only a few of your thousands of
> repositories are actively used. There are many Gerrit installations serving
> thousands of repositories from a single server.

In our company, we provide customers a web editor, which allows them to edit their
files through the browsers. Behind the screen, 2 types of Git repositories are used:

- normal Git repository: workspace
- bare Git repository: remote git server

Normal Git repository are served as workspace. Each time a user connect, we create
a workspace for him (clone from remote). It's bounded to the HTTP session. Each
time user saves something, we store the data in a workspace. When they decide to
push / pull, we communicate with the remote Git server. We also maintain a caching
system to accelerate the clone. So briefly:

Git Server
 ^    |
 |    v
 |   Git Cache (web-server)
 |    |
 |    v
Workspace (web-server)

Using solution 1 with 2 GitServlets allows us to remove the Git Cache layout, but still
have a high speed for cloning repositories. I think I went too deep into the internal
implementation. I can start another thread if needed.

@Luca
> How many repos, users and locations are you going to support?
Thousands of repos, thousands of users, locations is world-wide but principally in US and France.

@Martin
Thanks for your advise. I didn't know all these about NFS and Git.

Thank you guys for your precious advice. I'll study all these notes with my colleagues today.
Have a nice day!

Mincong



On Wed, May 16, 2018 at 11:56 PM, Martin Fick <mfick@xxxxxxxxxxxxxx> wrote:
On Wednesday, May 16, 2018 10:12:39 PM Mincong Huang wrote:
> I'm creating a Git server, and I'd like to use JGit as
> implementation. JGit contains a module called
> `org.eclipse.jgit.http.server` which allows to achieve
> this easily via GitServlet[1]. However, I need the Git
> server to be clustered,
> to provide a scalable solution. I've two possible
> solutions, but I want to have your opinions about them.
>
> Solution 1: N GitServlets + 1 NFS
> Use N Git servlets and share the same network filesystem.
> Each server points the same file system in the network.
> This solution is used by GitLab, see
> https://docs.gitlab.com/ee/administration/high_availabili
> ty/nfs.html

I am not sure that doc is accurate.  I don't believe git or
jgit uses the type of "Advisory locking" they are referring
to, i.e. it does not lock files fcntl(), instead it uses lock
files.  Lock files works on most NFS (v2,3,4) implementations,
even without lockd.

NFS does have some caching issues though, and I suspect that
the lookupcache=positive mount option mentioned in that doc
would help, but I had not heard of it until now.  There are
other NFS options to disable caching also that will work.
They do impact performance, so most big installations do not
use them.  I suspect lookupcache=positive would be
acceptable for most installations performance wise.  There
currently are a few paths in jgit which can be improved in
order to get similar results to that mount option even
without it.  Some of these have been mentioned on the
repo/Gerrit mailing list recently, and we are currently
internally working on some (today even).


> Personally, I'm afraid of concurrent file
> access to Git repository, which leads
> to data corruption. According to this post[2], Git has
> mechanism to protect itself, e.g using index lock. But a
> Git bare repository does not have index, right? I'm
> confused.

Correct, bare repos have no index.  That reference also
mentions 1) git gc and 2) .keep files. 

#1 For git gc, there are some races, but this is true even
without NFS. 

I have a change up for review for jgit here to help reduce
one of these:  https://git.eclipse.org/r/c/122288/2  There's
is a similar race for loose objects that also needs to be
fixed.  That being said, that race has been around forever,
and no one has bothered to fix it because it is very rare
(although I do believe I have evidence of it happening for
loose objects recently)

#2 Messing with .keeps should not cause corruption issues. 


Many people use Gerrit with NFS for very large
installations, so using a simplified jgit GitServlet should
work as well as Gerrit on NFS,

-Martin

--
The Qualcomm Innovation Center, Inc. is a member of Code
Aurora Forum, hosted by The Linux Foundation



Back to the top