Support KB

GRW Collaboration files not visible on all filers

There are instances that we see the files are not getting replicated to all peer filers due to incorrect whiteout we are seeing more reports in 8.2.4.1 Version

BACKGROUND:


In order to provide accelerated notification of filesystem changes Cloudfs Nodes provide a functionality called Distributed Change Notify (DCN). Cloudfs Nodes
that are using collaboration register for Distributed Change Notifications from other Cloudfs Nodes which allow activity such as deletes and creates to
appear remotely before a Snapshot sync occurs.

This behavior only impacts filetypes in the Global Collaboration Setting.

There is an issue with these notifications which are causing the events, such as delete, to be marked with an incorrect expiration marker which causes the
events to remain visible beyond the time they are supposed to remain effective.

The issue is due to a memory sanitation bug in the function which triggers these notifications during unlink, which causes the value marking the expiration to be unpredictable. The bug exists prior to 8.1 but has only been seen in a limited manner in previous releases.

Description of Recommended Fix:
Sanitize the memory in pixel8/replock.samba/replock-lease.c:__rll_change_notify() then ensure the caller pixel8/replock.samba/replock-lease.c:rll_do_op() provides an appropriate asn depending on ownership. This is expected to be fixed in 8.2.5.0.18290.

 

We have 2 Options to detect the files: 

 

OPTION 1: 

A new tool is available at:

/mnt/cc1-ca/SUPPORT/toolbox/non-prod/opt-pix-bin/grw.py.2023-10-10-experimental

It will list it's version as: version: 2023/10/10 experimental

Do not use it for anything except this scan, its modified from the original. Do not run 'install' with it as that will install it to the default location, which is not preferred for a custom tool.

It will detect Upper paths that match the pattern. It needs to be run on each node individually, because that is where any changes could be made, if I made it work to scan from anywhere you'd need to remote to run any fixes.

Run it as follows:

version: 2023/10/10 experimental

/mnt/etc/grw.py.2023-10-10-experimental find -t f '<lessor-path>' > /cloudfs/.support/replication-issue.out 2> /cloudfs/.support/replication-issue.err & 

It will print the list of Upper paths.

Note: The script picks up temporary files like dwl2, dwl etc files as well which is meant to be a lock file, so please ignore them, do not include them in the fix. 

OPTION 2: 

Detect the files that are affected by running "grw.py gfi" and "grw.py fi" for the suspected file. If we see that is not replicated follow below steps:​

 ​

If "grw.py fi" has whiteout in .upper path on the FILER that is missing file, its most likely affected. ​

 ​

upper(10816402 100777 sz=0 .own=kpfhk-pz01 .obj=0 .sn=a4294967295/t0/c0whiteout) snp=3177262​

 ​

  • Generate the list of files:​

grwfind <path> -maxdepth 1 -type f > /cloudfs/.support/replication-issue.out 2> /cloudfs/.support/replication-issue.errors &​

 ​

Note: Pick only the files that are affected.​ Mostly all files under a particular file could be affected but please validate it. 

 ​

Ex: grwfind "/cloudfs/kpfny-pz01/projects/2619_ChangiT5/Drawings/BIM/BIM Exchanges/BIM_to_CAD/DWG/Revit_CAD Export" -maxdepth 1 -type f > /cloudfs/.support/replication-issue.out 2> /cloudfs/.support/replication-issue.err &​

 

FIX/WORKAROUND:

  • Run the fix from Lessor filer ONLY.​

cat /cloudfs/.support/replication-issue.out | grw.py force-claim -pr - > /cloudfs/.support/replication-issue.fixed 2> /cloudfs/.support/replication-issue.err &​

 ​

  • Validate by grw.py gfi - to ensure files are replicated​

cat /cloudfs/.support/replication-issue.out | grw.py gfi - > /cloudfs/.support/replication-issue-gfi.out 2> /cloudfs/.support/replication-issue-gfi.err &. and​

 ​

rc 'grwls -l <path> | wc -l'​

 ​

number of lines should be same on all filers.

More details in Jira CFS-8801

FIX is in patch CloudController_patch_8.2.4.2.18293 which can be applied on version 8.2.4.1. And later versions has the fix