qW d dl Z d dlmZmZ d dlmZmZmZmZ d dl m
Z
n
# e$ r eZ
Y nw xY wddl
mZmZmZmZ ddlmZmZmZmZ ddlmZ dd lmZmZ dd
lmZmZmZm Z m!Z!m"Z" e j# d Z$ e j% Z&e&' e j( d d!de)de*de*de+dee dee de,de,defdZ- d!dede*de*de+dee dee de,de,defdZ. d!de
de*de*de+dee dee de,de,defdZ/ d"de
de*de*de+dee dee de,defd Z0dS )# N)basenamesplitext)BinaryIOListOptionalSet)PathLike )coherence_ratioencoding_languagesmb_encoding_languagesmerge_coherence_ratios)IANA_SUPPORTEDTOO_BIG_SEQUENCETOO_SMALL_SEQUENCETRACE)
mess_ratio)CharsetMatchCharsetMatches)any_specified_encoding iana_nameidentify_sig_or_bom
is_cp_similaris_multi_byte_encodingshould_strip_sig_or_bomcharset_normalizerz)%(asctime)s | %(levelname)s | %(message)s 皙?TF sequencessteps
chunk_size thresholdcp_isolationcp_exclusionpreemptive_behaviourexplainreturnc h t | t t f s/t d t | |rJt j }t t t
t t | } | dk rt
d |rEt t t
|pt j t# t% | dddg d g S |At t d d
| d |D }ng }|At t dd
| d
|D }ng }| ||z k r't t d||| d}| }|dk r| |z |k rt+ | |z }t | t, k }
t | t. k }|
r4t t d | n5|r3t t d | g }|rt1 | nd}
|
6| |
t t d|
t5 }g }g }d}d}d}t# }t7 | \ }}|D| | t t dt | | | d d|vr| d |t8 z D ]}|r||vr
|r||v r||v r| | d}||k }|ot= | }|dv r$|s"t t d| l t? | }n8# t@ tB f$ r$ t t d| Y w xY w |rS|du rOtE |du r| dt+ d n#| t | t+ d | n,tE |du r| n| t | d | }nx# tF tH f$ rd}t |tH s/t t d|tE | | | Y d}~d}~ww xY wd}|D ]}tK || rd} n|r$t t d|| tM |sdnt | | t+ | |z }|o|duot | | k } | r!t t d| t+ t | dz }!tO |!d }!d}"d}#g }$g }%|D ]m}&|&|z | d z k r| |&|&|z }'|r |du r||'z }' |'( ||rd!nd"# }(nK# tF $ r>}t t d$|tE | |!}"d}#Y d}~ nd}~ww xY w|r|&dk r~| |& d%k rrtS |d& })|r`|(d|) |vrTtM |&|&dz
d' D ]?}*| |*|&|z }'|r |du r||'z }'|'( |d!# }(|(d|) |v r n@|$ |( |% tU |(| |%d' |k r|"dz
}"|"|!k s|r|du r no|#s|r|s | t+ d( d ( |d"# n\# tF $ rO}t t d)|tE | | | Y d}~d}~ww xY w|%rtW |% t |% z nd}+|+|k s|"|!k r}| | t t d*||"tY |+d+z d,- |dd|
fv r*|#s(t% | ||dg | },||
k r|,}n|dk r|,}n|,}Ct t d.|tY |+d+z d,- |st[ | }-nt] | }-|-rAt t d/ |tE |- g }.|dk rB|$D ]?}(t_ |(d0|-rd1 |- nd }/|. |/ @ta |. }0|0r4t t d2 |0| | t% | ||+||0| ||
ddfv rt|+d0k rnt
d3| |r9t t t
| t# || g c S ||k rnt
d4| |r9t t t
| t# || g c S t | dk r|s|s|r t t d5 |r6t
d6|j1 | | n{|r||r|r|j2 |j2 k s|0t
d7 | | n1|r/t
d8 | | |rDt
d9|3 j1 t | dz
nt
d: |r9t t t
| |S );ae
Given a raw bytes sequence, return the best possibles charset usable to render str objects.
If there is no results, it is a strong indicator that the source is binary/not text.
By default, the process will extract 5 blocs of 512o each to assess the mess and coherence of a given sequence.
And will give up a particular code page after 20% of measured mess. Those criteria are customizable at will.
The preemptive behavior DOES NOT replace the traditional detection workflow, it prioritize a particular code page
but never take it for granted. Can improve the performance.
You may want to focus your attention to some code page or/and not others, use cp_isolation and cp_exclusion for that
purpose.
This function will strip the SIG in the payload/sequence every time except on UTF-16, UTF-32.
By default the library does not setup any handler other than the NullHandler, if you choose to set the 'explain'
toggle to True it will alter the logger configuration to add a StreamHandler that is suitable for debugging.
Custom logging format and handler can be set manually.
z4Expected object of type bytes or bytearray, got: {0}r z