3
Re0 @ s d Z ddlZddlZddlZddlmZ ddlmZmZm Z ddl
mZ ddlm
Z
ddlmZ dd lmZ G d
d deZdS )a
Module containing the UniversalDetector detector class, which is the primary
class a user of ``chardet`` should use.
:author: Mark Pilgrim (initial port to Python)
:author: Shy Shalom (original C code)
:author: Dan Blanchard (major refactoring for 3.0)
:author: Ian Cordasco
N )CharSetGroupProber)
InputStateLanguageFilterProbingState)EscCharSetProber)Latin1Prober)MBCSGroupProber)SBCSGroupProberc @ sn e Zd ZdZdZejdZejdZejdZ dddd d
ddd
dZ
ejfddZ
dd Zdd Zdd ZdS )UniversalDetectoraq
The ``UniversalDetector`` class underlies the ``chardet.detect`` function
and coordinates all of the different charset probers.
To get a ``dict`` containing an encoding and its confidence, you can simply
run:
.. code::
u = UniversalDetector()
u.feed(some_bytes)
u.close()
detected = u.result
g?s [-]s (|~{)s [-]zWindows-1252zWindows-1250zWindows-1251zWindows-1256zWindows-1253zWindows-1255zWindows-1254zWindows-1257)z
iso-8859-1z
iso-8859-2z
iso-8859-5z
iso-8859-6z
iso-8859-7z
iso-8859-8z
iso-8859-9ziso-8859-13c C sN d | _ g | _d | _d | _d | _d | _d | _|| _tj t
| _d | _| j
d S )N)_esc_charset_prober_charset_probersresultdone _got_data_input_state
_last_charlang_filterlogging getLogger__name__logger_has_win_bytesreset)selfr r /builddir/build/BUILDROOT/alt-python36-pip-20.2.4-5.el9.x86_64/opt/alt/python36/lib/python3.6/site-packages/pip/_vendor/chardet/universaldetector.py__init__Q s zUniversalDetector.__init__c C sZ dddd| _ d| _d| _d| _tj| _d| _| jr>| jj x| j
D ]}|j qFW dS )z
Reset the UniversalDetector and all of its probers back to their
initial states. This is called by ``__init__``, so you only need to
call this directly in between analyses of different documents.
Ng )encoding
confidencelanguageF )r r r r r
PURE_ASCIIr r r r r
)r proberr r r r ^ s
zUniversalDetector.resetc C s> | j r
dS t|sdS t|ts(t|}| js|jtjrJdddd| _nv|jtj tj
frldddd| _nT|jdrdddd| _n:|jd rd
ddd| _n |jtjtjfrdddd| _d| _| jd
dk rd| _ dS | j
tjkr.| jj|rtj| _
n*| j
tjkr.| jj| j| r.tj| _
|dd | _| j
tjkr| js^t| j| _| jj|tjkr:| jj| jj | jjd| _d| _ n| j
tjkr:| jst | jg| _| jt!j"@ r| jj#t$ | jj#t% x@| jD ]6}|j|tjkr|j|j |jd| _d| _ P qW | j&j|r:d| _'dS )a
Takes a chunk of a document and feeds it through all of the relevant
charset probers.
After calling ``feed``, you can check the value of the ``done``
attribute to see if you need to continue feeding the
``UniversalDetector`` more data, or if it has made a prediction
(in the ``result`` attribute).
.. note::
You should always call ``close`` when you're done feeding in your
document if ``done`` is not already ``True``.
Nz UTF-8-SIGg ? )r r r zUTF-32s zX-ISO-10646-UCS-4-3412s zX-ISO-10646-UCS-4-2143zUTF-16Tr r )(r len
isinstance bytearrayr
startswithcodecsBOM_UTF8r BOM_UTF32_LEBOM_UTF32_BEBOM_LEBOM_BEr r r" HIGH_BYTE_DETECTORsearch HIGH_BYTEESC_DETECTORr ESC_ASCIIr r r feedr FOUND_ITcharset_nameget_confidencer r
r r NON_CJKappendr
r WIN_BYTE_DETECTORr )r byte_strr# r r r r5 o s|
zUniversalDetector.feedc C s | j r| jS d| _ | js&| jjd n| jtjkrBdddd| _n| jtjkrd}d}d}x,| j D ]"}|slqb|j
}||krb|}|}qbW |r|| jkr|j}|jj
}|j
}|jd r| jr| jj||}|||jd| _| jj tjkrz| jd
dkrz| jjd xn| j D ]d}|s qt|trZxF|jD ] }| jjd|j|j|j
q4W n| jjd|j|j|j
qW | jS )
z
Stop analyzing the current document and come up with a final
prediction.
:returns: The ``result`` attribute, a ``dict`` with the keys
`encoding`, `confidence`, and `language`.
Tzno data received!asciig ?r$ )r r r Ng ziso-8859r z no probers hit minimum thresholdz%s %s confidence = %s)r r r r debugr r r" r2 r
r8 MINIMUM_THRESHOLDr7 lowerr) r ISO_WIN_MAPgetr getEffectiveLevelr DEBUGr' r probers) r prober_confidencemax_prober_confidence
max_proberr# r7 lower_charset_namer group_proberr r r close s`
zUniversalDetector.closeN)r
__module____qualname____doc__r? recompiler0 r3 r; rA r ALLr r r5 rK r r r r r 3 s"
mr )rN r* r rO charsetgroupproberr enumsr r r escproberr latin1proberr mbcsgroupproberr sbcsgroupproberr
objectr r r r r