s(mrxddlZddlZddlZddlmZddlmZmZmZm Z ddl m Z ddl m Z mZGddZdS)N)Path)DictListOptionalAny)drop_privileges)DOCROOT_EXCLUDE_DIRSDOCROOT_MAX_DEPTHceZdZdZdejfdZddededede d e e ee ff d Z ed fded e de d eefd Zdeded efdZdeded efdZdS)DocrootProcessorzO Processes individual docroot to collect .htaccess files and metadata. loggerc||_dS)N)r )selfr s ,py/cl_website_collector/docroot_processor.py__init__zDocrootProcessor.__init__s  docrootdomainsusernametimeoutreturnc <tj}|||ggdddd} |jd|t|5||t |dz }|jdt |||D]}|jd||s|jd |n|D]}tj|z |kr#d |d <|jd |nb |jd |t|} | } | r#t| dn|} | r2|d | ||| dt| rVtj| tjr7| ||} |d | || | dn|jd|n#t$$r'} |jd|| Yd} ~ d} ~ wwxYwdddn #1swxYwYt |d|d<tj|z |d<|jd|d||dn3#t$$r&} |jd|| Yd} ~ nd} ~ wwxYw|S)ab Collect .htaccess file paths from a docroot without reading file contents. Args: docroot: Document root path domains: Domain names username: Owner username timeout: Processing timeout in seconds Returns: Dictionary with collected file paths or None if failed Fr)rrrhtaccess_file_pathssymlinkstimeout_reachedprocessing_time_secondshtaccess_files_foundzFinding .htaccess files in %s) max_depthrzFound %d .htaccess files in %sz - %szNo .htaccess files found in %sTrz@[WEBSITE-COLLECTOR] Timeout reached while collecting paths in %szCollecting .htaccess path: %sstrictr)linktargetr)location file_path real_path is_symlinkzCannot read file: %sz0[WEBSITE-COLLECTOR] Error collecting path %s: %sNrrz2Collected %d .htaccess file paths from %s in %.2fsz3[WEBSITE-COLLECTOR] Error processing docroot %s: %s)timer debugr_find_htaccess_filesr lenerrorrr(strresolveappend_normalize_pathexistsosaccessR_OK Exception)rrrrr start_timeresulthtaccess_filesr&pr(r'r%es rcollect_htaccess_pathsz'DocrootProcessor.collect_htaccess_pathssY[[  #%$'($%   6 a K  =w G G G *** p* p!%!:!:7N_ipstit!:!u!u !!"BCDWDWY`aaa!/;;IK%%h ::::%#pK%%&FPPPP%3 p p 9;;3g==8z9DocrootProcessor._find_htaccess_files..s0ZZZ$2P2PQUWX2Y2YZ1ZZZrz .htaccessz([WEBSITE-COLLECTOR] Error walking %s: %s)r)r3walkr r-pathrelpathcountsepris_filer4r.r5r0r6) rrr rr7r9dirsfilesdepthr&r;rDs ` @rr+z%DocrootProcessor._find_htaccess_filesnsY[[  V%'WW%5%5 > >!dE9;;+g55K%%&RT[\\\E7??EEGOOD'::@@HHEI%% DG[ZZZZdZZZQQQ%'' $T [ 8I"))++>Ic)nnbg>>>&--c)nn=== V V V K  H'ST U U U U U U U U VsD3E E=E88E= parent_pathdirnamecL t||z }|d}tD]`}t|}t|t jt|zs|j|jkrdSan#t$rYdSwxYwdS)a3 Check if directory should be excluded based on DOCROOT_EXCLUDE_DIRS. Supports both plain directory names (e.g. "node_modules") and nested paths (e.g. "wp-content/cache"). The check is performed against the full candidate path composed from parent_path and dirname. Fr!T) rr/r r.endswithr3rJnamer6)rrOrP candidatecandidate_normalized exclude_dirpatterns rrAz*DocrootProcessor._should_exclude_directorys [))G3I#,#4#4E#4#B#B 3  {++,--66rvG 7LMM !',66447     55 usB BB B! B!r&c tt|t|S#t$rt|jcYSwxYw)z: Normalize file path relative to docroot. )r.r relative_to ValueErrorrS)rr&rs rr1z DocrootProcessor._normalize_paths_ (tI224==AABB B ( ( ( ??' ' ' ' (s;>AAN)r)__name__ __module__ __qualname____doc__loggingLoggerrr.listintrrrr<r rr+boolrAr1r@rrr r s/w~RRcRDRCRZ]Rgo cNhRRRRhCTdf%%C%C%^a%kopskt%%%%NS34,((s(s((((((rr )r_r3r)pathlibrtypingrrrrclcommon.clpwdrcl_website_collector.constantsr r r r@rrrhs ,,,,,,,,,,,,******RRRRRRRRa(a(a(a(a(a(a(a(a(a(r