Hubbub
Macros | Functions
detect.c File Reference
#include <assert.h>
#include <stdbool.h>
#include <string.h>
#include <parserutils/charset/mibenum.h>
#include <hubbub/types.h>
#include "utils/utils.h"
#include "detect.h"

Go to the source code of this file.

Macros

#define PEEK(a)
 
#define ADVANCE(a)
 
#define ISSPACE(a)
 

Functions

static uint16_t hubbub_charset_read_bom (const uint8_t *data, size_t len)
 Inspect the beginning of a buffer of data for the presence of a UTF Byte Order Mark. More...
 
static uint16_t hubbub_charset_scan_meta (const uint8_t *data, size_t len)
 Search for a meta charset within a buffer of data. More...
 
static uint16_t hubbub_charset_parse_attributes (const uint8_t **pos, const uint8_t *end)
 Parse attributes on a meta tag. More...
 
static bool hubbub_charset_get_attribute (const uint8_t **data, const uint8_t *end, const uint8_t **name, uint32_t *namelen, const uint8_t **value, uint32_t *valuelen)
 Extract an attribute from the data stream. More...
 
parserutils_error hubbub_charset_extract (const uint8_t *data, size_t len, uint16_t *mibenum, uint32_t *source)
 Extract a charset from a chunk of data. More...
 
uint16_t hubbub_charset_parse_content (const uint8_t *value, uint32_t valuelen)
 Parse a content= attribute's value. More...
 
void hubbub_charset_fix_charset (uint16_t *charset)
 Fix charsets, according to the override table in HTML5, section 8.2.2.2. More...
 

Macro Definition Documentation

#define ADVANCE (   a)
Value:
while (pos < end - SLEN(a)) { \
if (PEEK(a)) \
break; \
pos++; \
} \
\
if (pos == end - SLEN(a)) \
return 0;
#define SLEN(s)
Definition: utils.h:34
#define PEEK(a)
Definition: detect.c:184

Definition at line 188 of file detect.c.

Referenced by hubbub_charset_scan_meta().

#define ISSPACE (   a)
Value:
(a == 0x09 || a == 0x0a || a == 0x0c || \
a == 0x0d || a == 0x20 || a == 0x2f)

Definition at line 198 of file detect.c.

Referenced by hubbub_charset_get_attribute(), hubbub_charset_parse_attributes(), hubbub_charset_parse_content(), and hubbub_charset_scan_meta().

#define PEEK (   a)
Value:
(pos < end - SLEN(a) && \
strncasecmp((const char *) pos, a, SLEN(a)) == 0)
#define SLEN(s)
Definition: utils.h:34

Definition at line 184 of file detect.c.

Referenced by hubbub_charset_scan_meta().

Function Documentation

parserutils_error hubbub_charset_extract ( const uint8_t *  data,
size_t  len,
uint16_t *  mibenum,
uint32_t *  source 
)

Extract a charset from a chunk of data.

Parameters
dataPointer to buffer containing data
lenBuffer length
mibenumPointer to location containing current MIB enum
sourcePointer to location containint current charset source
Returns
PARSERUTILS_OK on success, appropriate error otherwise

::mibenum and ::source will be updated on exit

The larger a chunk of data fed to this routine, the better, as it allows charset autodetection access to a larger dataset for analysis.

Meaning of *source on entry:

CONFIDENT - Do not pass Go, do not attempt auto-detection. TENTATIVE - We've tried to autodetect already, but subsequently discovered that we don't actually support the detected charset. Thus, we've defaulted to Windows-1252. Don't perform auto-detection again, as it would be futile. (This bit diverges from the spec) UNKNOWN - No autodetection performed yet. Get on with it.

Todo:
We probably want to wait for ~512 bytes of data / 500ms here
Todo:
Charset autodetection

Definition at line 43 of file detect.c.

References HUBBUB_CHARSET_CONFIDENT, hubbub_charset_fix_charset(), hubbub_charset_read_bom(), hubbub_charset_scan_meta(), HUBBUB_CHARSET_TENTATIVE, and SLEN.

Referenced by hubbub_parser_create().

void hubbub_charset_fix_charset ( uint16_t *  charset)

Fix charsets, according to the override table in HTML5, section 8.2.2.2.

Character encoding requirements http://www.whatwg.org/specs/web-apps/current-work/#character0

Parameters
charsetPointer to charset value to fix

Definition at line 666 of file detect.c.

References SLEN.

Referenced by hubbub_charset_extract(), hubbub_parser_create(), and process_meta_in_head().

bool hubbub_charset_get_attribute ( const uint8_t **  data,
const uint8_t *  end,
const uint8_t **  name,
uint32_t *  namelen,
const uint8_t **  value,
uint32_t *  valuelen 
)
static

Extract an attribute from the data stream.

Parameters
dataPointer to pointer to current location (updated on exit)
endPointer to end of data stream
namePointer to location to receive attribute name
namelenPointer to location to receive attribute name length
valuePointer to location to receive attribute value
valuelenPointer to location to receive attribute value langth
Returns
true if attribute extracted, false otherwise.

Note: The caller should heed the returned lengths; these are the only indicator that useful content resides in name or value.

Definition at line 486 of file detect.c.

References ISSPACE.

Referenced by hubbub_charset_parse_attributes(), and hubbub_charset_scan_meta().

uint16_t hubbub_charset_parse_attributes ( const uint8_t **  pos,
const uint8_t *  end 
)
static

Parse attributes on a meta tag.

Parameters
posPointer to pointer to current location (updated on exit)
endPointer to end of data stream
Returns
MIB enum of detected encoding, or 0 if none found

Definition at line 299 of file detect.c.

References hubbub_charset_get_attribute(), hubbub_charset_parse_content(), ISSPACE, name, and SLEN.

Referenced by hubbub_charset_scan_meta().

uint16_t hubbub_charset_parse_content ( const uint8_t *  value,
uint32_t  valuelen 
)

Parse a content= attribute's value.

Parameters
valueAttribute's value
valuelenLength of value
Returns
MIB enum of detected encoding, or 0 if none found

Definition at line 368 of file detect.c.

References ISSPACE, and SLEN.

Referenced by hubbub_charset_parse_attributes(), and process_meta_in_head().

uint16_t hubbub_charset_read_bom ( const uint8_t *  data,
size_t  len 
)
static

Inspect the beginning of a buffer of data for the presence of a UTF Byte Order Mark.

Parameters
dataPointer to buffer containing data
lenBuffer length
Returns
MIB enum representing encoding described by BOM, or 0 if not found

Definition at line 161 of file detect.c.

References SLEN.

Referenced by hubbub_charset_extract().

uint16_t hubbub_charset_scan_meta ( const uint8_t *  data,
size_t  len 
)
static

Search for a meta charset within a buffer of data.

Parameters
dataPointer to buffer containing data
lenLength of buffer
Returns
MIB enum representing encoding, or 0 if none found

Definition at line 209 of file detect.c.

References ADVANCE, hubbub_charset_get_attribute(), hubbub_charset_parse_attributes(), ISSPACE, min, PEEK, and SLEN.

Referenced by hubbub_charset_extract().