Convert String to Unicode Characters in Python
Strings are the Python's sequence data structure that is used to represent text. Unicode is a type of character encoding technique that covers all characters with human languages and various symbols. The text can be handled in a uniform way and represented in a standardized way. There are many built-in Python modules and packages that support Unicode strings and create Unicode strings. The string can be converted into Unicode characters in Python in many ways, such as re.sub(), ord(), lambda, join() function and the format() functions.
Here is an example of Python string Unicode:
Code:
# Quick examples of converting string to Unicode characters
import re
# Initializing string
string = "This is a string"
# Example 1: Using re.sub() and ord() method
def unicode_replacement(match):
char = match.group()
return r'\u{:04X}'.format(ord(char))
result = re.sub(r'.', unicode_replacement, string)
print(result)
# Example 2: Convert string to Unicode
# Using re.sub() and ord() method
result = (re.sub('.', lambda x: r'\u % 04X' % ord(x.group()), string))
print(result)
# Example 3: Convert string to Unicode
# Using join(), format(), and ord() method
string = "Python"
unicode_result = ' '.join([r'\u{:04X}'.format(ord(ele)) for ele in string])
print(unicode_result)
Output:
\u0054\u0068\u0069\u0073\u0020\u0069\u0073\u0020\u0061\u0020\u0073\u0074\u0072\u0069\u006E\u0067
\u 054\u 068\u 069\u 073\u 020\u 069\u 073\u 020\u 061\u 020\u 073\u 074\u 072\u 069\u 06E\u 067
\u0050 \u0079 \u0074 \u0068 \u006F \u006E
Explanation:
In the above code, there are three examples of converting a string to a Unicode string. The re-module is imported, and a string is taken as input. In the first example, a Unicode_replacement function is made that takes the parameter as a match, and in the function, the character is grouped, and the Unicode string is returned. In the second example, the lambda function is used to convert the string into a Unicode string. In the third example, the string is Python, and this string is joined with the help of format. First, the string is converted into binary and then converted into the Unicode string.
Convert String to Unicode Using re.sub() and ord() Method:
The re.sub() and ord() functions can be used to convert the string to a Unicode string. Consider an example where a re-module can be imported to work with a regular expression and define the string that contains the characters you want to convert to Unicode. The string can then be converted with the help of the function Unicode_replacement, which takes a parameter as a match object. The Unicode_replacement function extracts the characters that are matched with the help of match.group() method, and convert it to a Unicode code point with the help of the ord() function and formats the ‘{:04X}’.format(ord(char)) is used to convert format the Unicode point into a four-digit hexadecimal representation with leading zeros. The function re.sub() is used to replace each character in the with the Unicode representation. Each character is matched with the help of regular expression. The function Unicode_replacement replace the re.sub(). Finally, the Unicode string is printed.
Code:
import re
# Initializing string
string = "This is the string"
print("Original string:", string)
# Using re.sub() and ord() method
def unicode_replacement(match):
char = match.group()
return r'\u{:04X}'.format(ord(char))
result = re.sub(r'.', unicode_replacement, string)
print("After converting the string to unicode:\n", result)
Output:
Original string: This is the string
After converting the string to Unicode:
\u0054\u0068\u0069\u0073\u0020\u0069\u0073\u0020\u0074\u0068\u0065\u0020\u0073\u0074\u0072\u0069\u006E\u0067
Explanation:
In the above code, each character of the string is matched, and the character is converted to a four-digit hexadecimal representation, and then this hexadecimal representation is converted to a Unicode string.
Code:
import re
# Initializing string
string = "This is the string"
print("Original string:", string)
# Convert string to Unicode
# Using re.sub() and ord() method
result = (re.sub('.', lambda x: r'\u % 04X' % ord(x.group()), string))
print("After converting the string to unicode:\n", result)
Output:
Original string: This is the string
After converting the string to Unicode:
\u 054\u 068\u 069\u 073\u 020\u 069\u 073\u 020\u 074\u 068\u 065\u 020\u 073\u 074\u 072\u 069\u 06E\u 067
Explanation:
In the above code, each character of the string is converted into its Unicode representation with the help of the re.sub() method and the ord() function. Then, each character is matched with the help of regular expression, and the re.sub() function lambda function converts each character to the Unicode point with the help %04X format specified for four-digit hexadecimal representation.
Convert String to Unicode Using join(), format(), and ord()
The join () method can also be used with the format() method and ord() function to convert a string to its Unicode representation.
Code:
# Initializing string
string = "Python"
print("Original string:", string)
# Convert string to Unicode
# Using join(), format(), and ord() method
unicode_result = ' '.join([r'\u{:04X}'.format(ord(ele)) for ele in string])
print("After converting the string to Unicode:", unicode_result)
Output:
Original string: Python
After converting the string to Unicode: \u0050 \u0079 \u0074 \u0068 \u006F \u006E
Explanation:
In the above code, the list comprehension is used to iterate over each character of the string. For every character of the string, the ord() function is used to get the Unicode code point, and then the format() method is used to create the Unicode escape sequence with the help of the hexadecimal value of the code point. The resulting Unicode escape sequences are joined using the join() method; the space is used to separate the Unicode code point.
Conclusion:
For converting a string to its Unicode characters, there are many Python methods, such as re.sub(), ord(), lambda, join() and format() functions that can be used to convert the string to its Unicode representation.