Nvtext Normalize#
- group Normalizing
Functions
-
std::unique_ptr<cudf::column> normalize_spaces(cudf::strings_column_view const &input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Returns a new strings column by normalizing the whitespace in each string in the input column.
Normalizing a string replaces any number of whitespace character (character code-point <= ‘ ‘) runs with a single space ‘ ‘ and trims whitespace from the beginning and end of the string.
Example: s = ["a b", " c d\n", "e \t f "] t = normalize_spaces(s) t is now ["a b","c d","e f"]
A null input element at row
i
produces a corresponding null entry for rowi
in the output column.- Parameters:
input – Strings column to normalize
mr – Device memory resource used to allocate the returned column’s device memory
stream – CUDA stream used for device memory operations and kernel launches
- Returns:
New strings columns of normalized strings.
-
std::unique_ptr<cudf::column> normalize_characters(cudf::strings_column_view const &input, bool do_lower_case, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Normalizes strings characters for tokenizing.
This uses the normalizer that is built into the nvtext::subword_tokenize function which includes:
adding padding around punctuation (unicode category starts with “P”) as well as certain ASCII symbols like “^” and “$”
adding padding around the CJK Unicode block characters
changing whitespace (e.g.
"\t", "\n", "\r"
) to just space" "
removing control characters (unicode categories “Cc” and “Cf”)
The padding process here adds a single space before and after the character. Details on unicode category can be found here: https://unicodebook.readthedocs.io/unicode.html#categories
If
do_lower_case = true
, lower-casing also removes the accents. The accents cannot be removed from upper-case characters without lower-casing and lower-casing cannot be performed without also removing accents. However, if the accented character is already lower-case, then only the accent is removed.s = ["éâîô\teaio", "ĂĆĖÑÜ", "ACENU", "$24.08", "[a,bb]"] s1 = normalize_characters(s,true) s1 is now ["eaio eaio", "acenu", "acenu", " $ 24 . 08", " [ a , bb ] "] s2 = normalize_characters(s,false) s2 is now ["éâîô eaio", "ĂĆĖÑÜ", "ACENU", " $ 24 . 08", " [ a , bb ] "]
A null input element at row
i
produces a corresponding null entry for rowi
in the output column.This function requires about 16x the number of character bytes in the input strings column as working memory.
- Parameters:
input – The input strings to normalize
do_lower_case – If true, upper-case characters are converted to lower-case and accents are stripped from those characters. If false, accented and upper-case characters are not transformed.
stream – CUDA stream used for device memory operations and kernel launches
mr – Memory resource to allocate any returned objects
- Returns:
Normalized strings column
-
std::unique_ptr<character_normalizer> create_character_normalizer(bool do_lower_case, cudf::strings_column_view const &special_tokens = cudf::strings_column_view(cudf::column_view{cudf::data_type{cudf::type_id::STRING}, 0, nullptr, nullptr, 0}), rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Create a normalizer object.
Creates a normalizer object which can be reused on multiple calls to nvtext::normalize_characters
See also
- Parameters:
do_lower_case – If true, upper-case characters are converted to lower-case and accents are stripped from those characters. If false, accented and upper-case characters are not transformed.
special_tokens – Individual tokens including
[]
brackets. Default is no special tokens.stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
Object to be used with nvtext::normalize_characters
-
std::unique_ptr<cudf::column> normalize_characters(cudf::strings_column_view const &input, character_normalizer const &normalizer, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Normalizes the text in input strings column.
cn = create_character_normalizer(true) s = ["éâîô\teaio", "ĂĆĖÑÜ", "ACENU", "$24.08", "[a,bb]"] s1 = normalize_characters(s,cn) s1 is now ["eaio eaio", "acenu", "acenu", " $ 24 . 08", " [ a , bb ] "] cn = create_character_normalizer(false) s2 = normalize_characters(s,cn) s2 is now ["éâîô eaio", "ĂĆĖÑÜ", "ACENU", " $ 24 . 08", " [ a , bb ] "]
See also
nvtext::character_normalizer for details on the normalizer behavior
A null input element at row
i
produces a corresponding null entry for rowi
in the output column.- Parameters:
input – The input strings to normalize
normalizer – Normalizer to use for this function
stream – CUDA stream used for device memory operations and kernel launches
mr – Memory resource to allocate any returned objects
- Returns:
Normalized strings column
-
struct character_normalizer#
- #include <normalize.hpp>
Normalizer object to be used with nvtext::normalize_characters.
Use nvtext::create_normalizer to create this object.
This normalizer includes:
adding padding around punctuation (unicode category starts with “P”) as well as certain ASCII symbols like “^” and “$”
adding padding around the CJK Unicode block characters
changing whitespace (e.g.
"\t", "\n", "\r"
) to just space" "
removing control characters (unicode categories “Cc” and “Cf”)
The padding process adds a single space before and after the character. Details on unicode category can be found here: https://unicodebook.readthedocs.io/unicode.html#categories
If
do_lower_case = true
, lower-casing also removes any accents. The accents cannot be removed from upper-case characters without lower-casing and lower-casing cannot be performed without also removing accents. However, if the accented character is already lower-case, then only the accent is removed.If
special_tokens
are included the padding after[
and before]
is not inserted if the characters between them match one of the given tokens. Also, thespecial_tokens
are expected to include the[]
characters at the beginning of and end of each string appropriately.Public Functions
-
character_normalizer(bool do_lower_case, cudf::strings_column_view const &special_tokens, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Normalizer object constructor.
This initializes and holds the character normalizing tables and settings.
- Parameters:
do_lower_case – If true, upper-case characters are converted to lower-case and accents are stripped from those characters. If false, accented and upper-case characters are not transformed.
special_tokens – Each row is a token including the
[]
brackets. For example:[BOS]
,[EOS]
,[UNK]
,[SEP]
,[PAD]
,[CLS]
,[MASK]
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
-
std::unique_ptr<cudf::column> normalize_spaces(cudf::strings_column_view const &input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#