HGU_CSEE
Basics of Data 본문
1. How to represent numbers in computers
As you know, Computer language consists of only 0 and 1. At this point how does a computer that can only recognize 0 and 1 understand vast numbers? What is certain is that just because a computer understands 0 and 1 does not mean it can only express 0 and 1. Computers use a lot of 0 and 1 to express a lot of numbers.
the smallest data unit, that is, the smallest unit that the computer can understand is called "bit". 0<zero> or 1<one>
Therefore, a bit can express two of data. <0 and 1>
As the number of bits increases, the number that can be expressed increases by the square of 2. As you can see, the unit 'bit' is too small to use in our daily life. <No one says "My pdf file size is 8,123,285 bits.">
For that reason, there are many units of data size such as Kilobyte<equal to 1000 bytes>, Megabyte<equal to 1000 Kilobytes>, Gigabyte, Terabytes, and so on
WORD: It refers to the size of data that the CPU can process at once. This is a relatively special unit because it can change depending on the performance of the CPU(32bit or 64bit) rather than a fixed value.
By using some policy <Binary number or Decimal number, Hexadecimal number>, computers can understand and represent huge numbers.
"How to represent the negative binary numbers?"
→Using two's complement: It means a number minus 2^n which is greater than that number.
It can be denoted easily by Reversing all numbers in binary and adding 1 to it
ex). 011 <3 in decimal> -> 100//reverse all bits and plus 1 -> 101 <-3 in decimal>
010<2 in decimal> -> 101//reverse and plus 1 -> 110 <-2 in decimal>
As you can see it is difficult to determine the binary is positive or negative.
For example, if I want to represent -11 in the binary number, through two's complement, I can express it by
$$0101_{(2)}$$ It is the same with the positive number '5' in decimal. It confuses us obviously. Therefore, in the computer, the "flag" is used for determining whether the number is positive or negative. More details about flags will be covered in the chapter "Register". For now, it is only necessary to think that the numbers inside the computer have a flag to distinguish whether they are negative or positive.
※Two's complement's limit
Two's complement is a useful method to distinguish numbers, but it is not a perfect way.
Let's think about 1000 <Binary>. If you check its negative version, the result is the same as its positive shape.
1000 -> 0111 +1 <Two's complement>-> 1000
In other words, Two's complement method can't represent both positive and negative numbers of 2^n at the same time, just with n bit. If we need to negative number of 2^4, we need 5 bits.
<The other methods of expressing numbers were so easy that they were omitted.>
2. How to represent characters in computer
In order to represent characters with only 0 and 1, you need to understand 'Character set', 'Encoding', and 'Decoding'.
- Character set
It refers to a set of characters that a computer can understand and output. For example, a computer has a character set <a,b,c,d,e>. That means, the computer cannot understand 'f', and others. Then, how does the computer can understand the characters? -> Encoding - Encoding
Even if a character belongs to a character set, the computer cannot understand the character itself. It is similar to a number. Just as computers convert decimal numbers into binary numbers, they have to go through the process of converting characters that humans understand into 0 and 1, called encoding. - Decoding
On the other hand, humans cannot understand characters converted to 0 and 1. Therefore, The process of converting data consisting of 0 and 1 into characters that humans can understand is called decoding. This is the opposite process of encoding.
※ An early set of characters, the ASCII code https://en.wikipedia.org/wiki/ASCII
Characters in the ASCII can be represented by 7 bits each, which means the total number of ASCII characters is 128.
However, That is not enough to express Korean. <The extended ASCII code can express up to 256 characters with 8 bits, but it is also not enough.>
2-1. Korean style encoding: EUC-KR
In Korean, one syllable can be made up of a total of three parts of combinations. ('강'=ㄱ + ㅏ+ ㅇ)
Therefore, A unique encoding technique in Korean was needed. That is EUC-KR.
EUC-KR is based on a character set called 'KS X 1001', and 'KS X 1003. and it needs 2 bytes to represent one Korean character<Four hexadecimal numbers>. In that way, about 2,350 characters of Korean characters can be represented. but it is not enough to express all of the Korean characters.
2-2. Unicode and UTF-8
Unicode is a huge character set. In addition, it contains most special characters and can express languages around the world. Unicode has many encoding methods such as UTF-8, UTF-16, UTF-32, etc. UTF-8 is the most popular.
https://ko.wikipedia.org/wiki/UTF-8
'Computer Science > Computer Architecture & Organization' 카테고리의 다른 글
[Intro] The Big Picture of Computer Architecture & Organization (1) | 2023.09.09 |
---|