Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [jgit-dev] stable-4.1 fixes


On Apr 11, 2018, at 5:24 PM, Nasser Grainawi <nasser@xxxxxxxxxxxxxx> wrote:


On Apr 11, 2018, at 6:59 AM, Hugo Ares <hugo.ares@xxxxxxxxxxxx> wrote:

On Wed, 2018-04-11 at 09:14 +0200, Matthias Sohn wrote:
On Wed, Apr 11, 2018 at 2:08 AM, Hugo Ares <hugo.ares@xxxxxxxxxxxx> wrote:
On Wed, 2018-04-11 at 01:27 +0200, Matthias Sohn wrote:
On Wed, Apr 11, 2018 at 1:00 AM, Nasser Grainawi <nasser@xxxxxxxxxxxxxx> wrote:

On Apr 10, 2018, at 4:50 PM, Matthias Sohn <matthias.sohn@xxxxxxxxx> wrote:

On Wed, Apr 11, 2018 at 12:45 AM, Nasser Grainawi <nasser@xxxxxxxxxxxxxg> wrote:

On Apr 10, 2018, at 4:25 PM, Matthias Sohn <matthias.sohn@xxxxxxxxx> wrote:

On Tue, Apr 10, 2018 at 10:36 PM, Nasser Grainawi <nasser@xxxxxxxxxxxxxg> wrote:

On Mar 29, 2018, at 10:21 AM, Nasser Grainawi <nasser@xxxxxxxxxxxxxx> wrote:


On Mar 24, 2018, at 1:45 PM, Matthias Sohn <matthias.sohn@xxxxxxxxx> wrote:

On Sat, Mar 24, 2018 at 1:32 AM, David Pursehouse <david.pursehouse@gmail.com> wrote:


On Fri, 23 Mar 2018, 23:01 Matthias Sohn, <matthias.sohn@xxxxxxxxx> wrote:
On Fri, Mar 23, 2018 at 9:52 PM, Nasser Grainawi <nasser@xxxxxxxxxxxxxg> wrote:
Hey folks,

I'm doing some work to stabilize the Gerrit 2.12 release series that we're running in several places still internally. As part of that, I found a few JGit commits that I'd like to backport to the stable-4.1 branch.

You can see the fixes as the latest 12 commits on https://source.codeaurora.org/quic/gerrit4msm/jgit/log/?h=stable-4.1-backports



I cherry-picked the patches you want and pushed them for stable-4.1
 
Please review this series and let me know if you really want 4.1.3 or if you can use 4.5.4 like we do

I'll test out 4.5.4 today and let you know mid-next week. Thanks for telling me about that version and that you have used it in prod with 2.12, that really helps.


Sorry for the delay getting back to you. We tried 4.5.4 and it works pretty well. I found one NFS-related bug and uploaded a fix [1]. We're also seeing MissingObjectExceptions that we think are NFS related, but we're not certain. The setup we have for this instance is a master replicating over git:// to git-daemon through a load-balancer to two slaves sharing an NFS filesystem. The slaves are running a custom git.git (has a few NFS error-avoidance patches) for repacking.



did I get it right that the master also accesses git repositories via NFS ?

No, master has its own disk.


 So jgit running on master is victim to NFS issues on a slave when trying to replicate changes to the slave ?

No, sorry. We're only seeing these errors on the slaves. AFAIK, replication is working fine.


Ah, ok so this is UploadPack failure on slave when some client fetches from it and client
wants a commit not available on slave or slave doesn't see it yet since it arrived via replication
from master to the other slave. This sounds similar to issues we saw in concurrent push tests
to a multi-master setup running on NFS. We introduced new options supportsatomicfilecreation
and also setting trustfolderstat to false might help, though Hugo reported massive performance problems.

listing folder content of parent folder a given commit resides in instead of setting trustfolderstat to false could help.
I think Hugo said this works and is faster than setting trustfolderstat

On our side, all repositories are configured wiht trustfolderstat to false. We had that for a long time and this setting was only used
when listing the pack files. Without this setting, we used to have a lot of problems on slaves reading outdated stuff because of NFS.
When we deployed this change[1] which make this setting effective for packed-refs we had such big performance degradation that
it caused an outage.

We reverted that change internally and instead , based on Christian feedback, we simply list the content of the parent folder of the
packed-refs and this seems to be enough to not get an outdated version of the packed-refs from NFS.


can you please contribute this patch ?

Sure, that was the plan. I need to code this in JGit first though. What I did is revert [1] and patched Gerrit to list the content of the
parent folder of the packed-refs file right before Gerrit is about to update a reference. After the outage, I did not have time and our
"room for error/outage" bugdet was exhausted so I could not implement proper fix in JGit and test it.

Our servers are very stable these days so I will be able to test this patch on production before uploading it. I work on this as
soon as possible.

Thanks Hugo! We have trustfolderstat set to false and we're using 4.5.4 + my new fix and we're still seeing this exception. The team using these servers hasn't complained of a performance problem, but your improved fix does sound more optimal. I might get a change up for review this week (was going to be today, but got busy) that takes a stab at this.

As for the exception we're still seeing, it doesn't look like loose objects have any kind of protection from reading an outdated NFS cache (loose refs already do dir.list() a bunch, so I think they're good). I want to be able to add that protection without incurring a similar performance problem to packed-refs. For loose objects, currently the code checks the loose object cache, then continues on to look at packs if it can't find an object, and then finally falls back to looking for uncached loose objects. I was initially thinking we'd want some kind of outer wrapper when we've already checked both loose objects and packed and we still haven't found an object. Or we could choose to do similar to what we do with packs and do the directory listing (or it seems maybe dir open/close is enough) whenever we don't find the object during the final loose objects check.

Martin and I were also discussing if this is more of a 'refreshFolderStat' option, since we are trusting, just with a caveat. We could probably also re-work the packs check to do a directory listing or open/close for the objects dir so that we can trust the isModified result. And since we're really doing this pretty much any time we do isModified, we probably want to push it down into FileSnapshot. When I get a little more time I'll try to get some changes up for review for this stuff, but early idea feedback is welcome.



Ok, I have changes up for this now: https://git.eclipse.org/r/123296 and https://git.eclipse.org/r/123297

If anyone has time to review, that'd be great, thanks!




 




Why do you use slaves sharing the same NFS ?

Mostly because of less replication traffic and the slaves share the effort of repacking, but there's added benefits like knowing the slaves have a consistent view (one can't get out of date). With just 2 slaves it's not as big of a deal, but we have other instances with 8-10 slaves in a site.

 
This is what the exceptions look like (jgit code is the same as [1]):
[2018-04-06 10:42:38,524] [SSH git-upload-pack 'XXXXXXX' (YYYYYYY)] ERROR com.google.gerrit.sshd.BaseCommand : Internal server error (user YYYYYY account ZZZZZZ) during git-upload-pack 'XXXXXXX'
org.eclipse.jgit.transport.UploadPackInternalServerErrorException
        at org.eclipse.jgit.transport.UploadPack.service(UploadPack.java:768)
        at org.eclipse.jgit.transport.UploadPack.upload(UploadPack.java:667)
        at com.google.gerrit.sshd.commands.Upload.runImpl(Upload.java:80)
        at com.google.gerrit.sshd.AbstractGitCommand.service(AbstractGitCommand.java:101)
        at com.google.gerrit.sshd.AbstractGitCommand.access$000(AbstractGitCommand.java:32)
        at com.google.gerrit.sshd.AbstractGitCommand$1.run(AbstractGitCommand.java:70)
        at com.google.gerrit.sshd.BaseCommand$TaskThunk.run(BaseCommand.java:437)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
        at com.google.gerrit.server.git.WorkQueue$Task.run(WorkQueue.java:376)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.eclipse.jgit.transport.WantNotValidException: want 56c76057710dd3ca29e66ef6cdc4b727be781384 not valid
        at org.eclipse.jgit.transport.UploadPack.parseWants(UploadPack.java:1197)
        at org.eclipse.jgit.transport.UploadPack.processHaveLines(UploadPack.java:1072)
        at org.eclipse.jgit.transport.UploadPack.negotiate(UploadPack.java:1052)
        at org.eclipse.jgit.transport.UploadPack.service(UploadPack.java:745)
        ... 14 more
Caused by: org.eclipse.jgit.errors.MissingObjectException: Missing unknown 56c76057710dd3ca29e66ef6cdc4b727be781384
        at org.eclipse.jgit.internal.storage.file.WindowCursor.open(WindowCursor.java:158)
        at org.eclipse.jgit.lib.ObjectReader$1.open(ObjectReader.java:300)
        at org.eclipse.jgit.revwalk.RevWalk$2.next(RevWalk.java:971)
        at org.eclipse.jgit.transport.UploadPack.parseWants(UploadPack.java:1184)
        ... 17 more

We're assuming this error happens due to NFS caching directory entries. We're looking at testing that out by changing the mount options (lookupcache, noac, ac*time), but these are prod hosts, so we don't want to hurt the performance too much. Do you have any ideas on how we'd approach solving this in JGit or a different hypothesis on what could be causing this error?



AFAIR I never saw a WantNotValidException

@Chris you did a lot of testing on NFS when we implemented multi-master on shared NFS, maybe you have an idea ?

-Matthias

-- 
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, 
a Linux Foundation Collaborative Project


-- 
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, 
a Linux Foundation Collaborative Project



_______________________________________________
jgit-dev mailing list
jgit-dev@xxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://dev.eclipse.org/mailman/listinfo/jgit-dev



-- 
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, 
a Linux Foundation Collaborative Project

_______________________________________________
jgit-dev mailing list
jgit-dev@xxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://dev.eclipse.org/mailman/listinfo/jgit-dev

-- 
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, 
a Linux Foundation Collaborative Project


Back to the top