魔兽(12)divide and conquerAndConqu...

Topics Covered In This Chapter:
?Prime and Composite Numbers
?The Sieve of Eratosthenes
?The Rabin-Miller Primality Test
“Mathematicians have tried in vain to this day
to discover some order in the sequence of prime numbers, and we have reason to
believe that it is a mystery into which the human mind will never penetrate.”
Leonhard Euler, 18th
century mathematician
All of the ciphers described in this book so far have been
around for hundreds of years, and all of them (with the exception of the one-time
pad) are easily broken by computers. These ciphers worked very well when hackers
had to rely on pencil and paper to hack them, but computers can now manipulate
data trillions of times faster than a person with a pencil.
The RSA cipher has several improvements over these old
ciphers, and it will be detailed in the next chapter. However, the RSA cipher
will require us to learn about prime numbers first.
A prime number is an integer (that is, a whole number) that is
greater than 1 and has only two factors: 1 and itself. Remember that the
factors of a number are the numbers that can be multiplied to equal the
original number. The numbers 3 and 7 are factors of 21. The number 12 has
factors 2 and 6, but also 3 and 4.
Every number has factors of 1 and itself. The numbers 1 and
21 are factors of 21. The numbers 1 and 12 are factors of 12. This is because 1
times any number will always be that same number. But if no other factors exist
for that number, then that number is prime.
Here’s a list of prime numbers (note that 1 is not
considered a prime number):
2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53,
59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127, 131, 137,
139, 149, 151, 157, 163, 167, 173, 179, 181, 191, 193, 197, 199, 211, 223, 227,
229, 233, 239, 241, 251, 257, 263, 269, 271, 277, 281 …and so on, FOREVER.
There are an infinite number of prime numbers. This means
there is no “largest” prime. They just keep getting bigger and bigger, just
like regular numbers do. (See /infiniteprimes
for a proof of this.) The RSA cipher makes use of very large prime numbers in
the keys it uses. Because of this, there will be far too many keys to brute-force
A googol is the number that has a one with a
hundred zeros behind it:
10,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000
(As a bit of history: the name of the search engine company
Google came from misspelling “googol”, but they decided they liked that
spelling better and kept it.)
billion billion googols has twenty-seven more zeros than a
10,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000
But these are tiny numbers. A typical prime number used in
our RSA program will have hundreds of digits:
112,829,754,900,439,506,175,719,191,782,841,802,172,556,768,253,593,054,977,186,,780,304,652,423,405,148,425,447,063,090,165,759,070,742,102,132,335,103,295,947,000,718,386,333,756,395,799,633,478,227,612,244,071,875,721,006,813,307,628,061,280,861,610,153,485,352,017,238,548,269,452,852,733,818,231,045,171,038,838,387,845,888,589,411,762,622,041,204,120,706,150,518,465,720,862,068,595,814,264,819
The above number is so big, I’m going to guess you didn’t
even read it to notice the typo in it.
Integers that are not prime numbers are called composite
numbers, because they are composed of at least two factors besides 1 and the
number itself. They are called composite numbers because these number are composed
of prime numbers multiplied together, such as the composite number 1,386 being
composed of the prime numbers in 2 × 3 × 3 × 7 × 11.
Here are four facts about prime numbers:
numbers are integers greater than 1that have only 1 and themselves as factors.
2.Two is the only even prime number. (Though I guess that makes two a very
odd prime number.)
3.There are an infinite number of prime numbers. There is no “largest
prime number”.
4.Multiplying
two prime numbers will give a number that only has two pairs of factors, 1 and
itself (like all numbers do), and the two prime numbers that were multiplied.
For example, 3 and 7 are prime numbers. 3 times 7 is 21. The only factors for
21 are 1, 21, 3, and 7.
First we will create a module that has functions related to
prime numbers:
?isPrime() will return either True or
False if the number passed to it is prime or not. It will use the “divides
test” algorithm.
?primeSieve() will use the “Sieve of
Eratosthenes” algorithm (explained later) to generate numbers.
Like cryptomath.py, the primeSieve.py program is only meant to be imported as a
module to our other programs. It does not do anything when run on its own.
Open a new file editor window by clicking on File ► New Window. Type in
the following code into the file editor, and then save it as primeSieve.py.
code for primeSieve.py
import math
def isPrime(num):
if num & 2:
return False
for i in range(2, int(math.sqrt(num)) + 1):
if num % i == 0:
return False
return True
def primeSieve(sieveSize):
sieve = [True] * sieveSize
sieve[0] = False
sieve[1] = False
for i in range(2, int(math.sqrt(sieveSize)) + 1):
pointer = i * 2
while pointer & sieveSize:
sieve[pointer] = False
pointer += i
primes = []
for i in range(sieveSize):
if sieve[i] == True:
primes.append(i)
return primes
import math
The only module primeSieve.py
needs is the math module.
def isPrime(num):
if num & 2:
return False
We will program the isPrime()
function to return False if the num
parameter is a composite number and True if the num parameter is a prime number. If num
is less than 2 we know it is not prime and can
simply return False.
for i in range(2, int(math.sqrt(num)) + 1):
if num % i == 0:
return False
return True
A prime number has no factors besides 1 and itself. So to
find out if a given number is prime or not, we just have to keep dividing it by
integers and see if any of them evenly divide the number with 0 remainder.
The math.sqrt() function will
return a float value of the square root of the number it is passed. The square
root of a number can be multiplied by itself to get the number. For example,
the square root of 49 is 7, because 7 × 7 is 49.
For example, to find out if 49 is prime, divide it by the
integers starting with 2:
49 ÷ 2 = 24 remainder
49 ÷ 3 = 16 remainder
49 ÷ 4 = 12 remainder
49 ÷ 5 = 9 remainder
49 ÷ 6 = 8 remainder
49 ÷ 7 = 7 remainder
Actually, you only need to divide the number by prime
numbers. For example, there’s no reason to see if 6 divides 49, because if it
did then 2 would have divided 49 since 2 is a factor of 6. Any number that 6
divides evenly can also be divided by 2 evenly:
Figure 23-1. 2 divides 6, and 6 divides X, therefore 2
divides X.
Because there’s an integer (that is not 1 or 49) that evenly
divides 49 (that is, has a remainder of 0), we know that 49 is not a prime
number. For another example, let’s try 13:
13 ÷ 2 = 6 remainder
13 ÷ 3 = 4 remainder
13 ÷ 4 = 3 remainder
No integer divides 13 with a remainder of 0 (except
for 1 and 13, which don’t count). Actually, we don’t even have to divide all the numbers up
to 13. We only have to test the integers up to (and including) the square root
of the number we are testing for primality. The square root of a number is the number
that is multiplied by itself to get that first number. The square root of 25 is
5, because 5 × 5 = 25. The square root of
13 is about 3.9, because 3.9 × 3.9 = 13. This means that when we were testing 13
for primality, we only had to divide 2 and 3 because 4 is larger than the
square root of 13.
Line 18 checks if the remainder of division is 0 by using
the % mod operator. If line 17’s for loop never returns False,
the function will return True on line 20.
The sieve of Eratosthenes (pronounced “era, taws, thuh,
knees”) is an algorithm for calculating prime numbers. Imagine a bunch of boxes
for each integer, all marked “prime”:
Table 23-1. A blank sieve of Eratosthenes, with each
number marked as “prime”.
Mark 1 as “Not Prime” (since one
is never prime). Then mark all the multiples of two (except for two itself) as
“Not Prime”. This means we will mark the integers 4 (2 × 2), 6 (2 × 3), 8 (2 ×
4), 10, 12, and so on up to 50 (the largest number we have) are all marked as
“Not Prime”:
Table 23-2. The sieve with one and the multiples of 2
(except 2 itself) marked as “not prime”.
Then repeat this with all the multiples of three, except for
three itself: 6, 9, 12, 15, 18, 21, and so on are all marked “Not Prime”. Then
do this for all of the multiples of four (except for four itself), and all of
the multiples of five (except for five itself), up until eight. We stop at 8
because it is larger than 7.071, the square root of 50). We can do this because
all the multiples of 9, 10, 11, and so on will already have been marked.
The completed sieve looks like this:
Table 23-3. A completed sieve of Eratosthenes.
By using the sieve of Erastothenes, we’ve calculated that
the prime numbers under 50 are 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41,
43, and 47. This sieve algorithm is good when we want to quickly find out all
the prime numbers in a
certain range of numbers. It is much faster than using the previous
“divides test” algorithm to test if 2 is prime, then test if 3 is prime, then
test if 4 is prime, and so on.
primeSieve()
def primeSieve(sieveSize):
sieve = [True] * sieveSize
sieve[0] = False
sieve[1] = False
The primeSieve() function returns
a list of all prime numbers between 1 and sieveSize. First, line 27 creates a list of Boolean True values that is the length of sieveSize.
The 0 and 1 indexes are
marked as False because 0
and 1 are not prime numbers.
for i in range(2, int(math.sqrt(sieveSize)) + 1):
pointer = i * 2
while pointer & sieveSize:
sieve[pointer] = False
pointer += i
The for loop on line 32 goes
through each integer from 2 up to the square root of
sieveSize. The variable pointer
will start at the first multiple of i after i (which will be i * 2). Then
the while loop will set the pointer
index in the sieve list to False,
and line 36 will change pointer to point to the next
multiple of i.
primes = []
for i in range(sieveSize):
if sieve[i] == True:
primes.append(i)
After the for loop on line 32
completes, the sieve list will contain True for each index that is a prime number. We can create
a new list (which starts as an empty list in primes) and loop over the entire sieve list, and appends and numbers if sieve[i] is True (meaning i is prime).
The list of prime numbers is returned on line 44.
The isPrime() function in primeSieve.py checks if the number can be divided evenly
by a range of numbers from 2 to the square root of the number. But what about a
number like 1,070,595,206,942,983? If you pass this integer to isPrime(), it takes several seconds to determine if it is
prime or not. And if the number is hundreds of digits long (like the prime
numbers in next chapter’s RSA cipher program are), it would take over a
trillion years to figure out if that one
number is prime or not.
The isPrime() function in primeSieve.py is too slow for the large numbers we will
use in the RSA cipher. Fortunately there is an algorithm called the
Rabin-Miller Primality Test than can calculate if such large numbers are prime
or not. We will create a new isPrime() function in rabinMiller.py that makes use of this better algorithm.
The code for this algorithm uses advanced mathematics, and
how the algorithm works is beyond the scope of this book. Like the gcd() function in cryptomath.py,
this book will just present the code for the algorithm for you to use without
explanation.
Open a new file editor window and type in the following
code. Save this file as rabinMiller.py. This
program will be a module that is meant to be imported by other programs.
Instead of typing out the list of numbers on line 43, just
temporarily add the lines import pyperclip and pyperclip.copy(primeSieve(1000)) in the primeSieve.py file and run it. This will copy the list of
primes to the clipboard, and then you can paste it into the rabinMiller.py file.
Open a new file editor window by clicking on File ► New Window. Type in
the following code into the file editor, and then save it as rabinMiller.py.
code for rabinMiller.py
import random
def rabinMiller(num):
s = num - 1
while s % 2 == 0:
s = s // 2
for trials in range(5):
a = random.randrange(2, num - 1)
v = pow(a, s, num)
if v != 1:
while v != (num - 1):
if i == t - 1:
return False
v = (v ** 2) % num
return True
def isPrime(num):
if (num & 2):
return False
= [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71,
73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127, 131, 137, 139, 149, 151, 157,
163, 167, 173, 179, 181, 191, 193, 197, 199, 211, 223, 227, 229, 233, 239, 241,
251, 257, 263, 269, 271, 277, 281, 283, 293, 307, 311, 313, 317, 331, 337, 347,
349, 353, 359, 367, 373, 379, 383, 389, 397, 401, 409, 419, 421, 431, 433, 439,
443, 449, 457, 461, 463, 467, 479, 487, 491, 499, 503, 509, 521, 523, 541, 547,
557, 563, 569, 571, 577, 587, 593, 599, 601, 607, 613, 617, 619, 631, 641, 643,
647, 653, 659, 661, 673, 677, 683, 691, 701, 709, 719, 727, 733, 739, 743, 751,
757, 761, 769, 773, 787, 797, 809, 811, 821, 823, 827, 829, 839, 853, 857, 859,
863, 877, 881, 883, 887, 907, 911, 919, 929, 937, 941, 947, 953, 967, 971, 977,
983, 991, 997]
if num in lowPrimes:
return True
for prime in lowPrimes:
if (num % prime == 0):
return False
return rabinMiller(num)
def generateLargePrime(keysize=1024):
while True:
num = random.randrange(2**(keysize-1), 2**(keysize))
if isPrime(num):
return num
If you run the interactive shell, you can import the rabinMiller.py module and call the functions in it. Try
typing the following into the interactive shell:
import rabinMiller
rabinMiller.generateLargePrime()
rabinMiller.isPrime(48451)
rabinMiller.isPrime(13)
import random
The Rabin-Miller algorithm uses random numbers, so we import
the random module on line 4.
def rabinMiller(num):
s = num - 1
while s % 2 == 0:
s = s // 2
for trials in range(5):
a = random.randrange(2, num - 1)
v = pow(a, s, num)
if v != 1:
while v != (num - 1):
if i == t - 1:
return False
v = (v ** 2) % num
return True
The mathematics of the Rabin-Miller Primality algorithm are
beyond the scope of this book, so the code in this function will not be
explained.
The Rabin-Miller algorithm is also not a surefire test for
however, you can be extremely certain that it is accurate. (Although
this is not good enough for commercial encryption software, it is good enough
for the purposes of the programs in this book.) The main benefit of the
Rabin-Miller algorithm is that it is a relatively simple primality test and
only takes a few seconds to run on a normal computer.
If rabinMiller() returns True, then the num argument is
extremely likely to be prime. If rabinMiller()
returns False, then num
is definitely composite.
def isPrime(num):
if (num & 2):
return False
All numbers that are less than two (such as one, zero, and
negative numbers) are all not prime, so we can immediately return False.
= [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71,
73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127, 131, 137, 139, 149, 151, 157,
163, 167, 173, 179, 181, 191, 193, 197, 199, 211, 223, 227, 229, 233, 239, 241,
251, 257, 263, 269, 271, 277, 281, 283, 293, 307, 311, 313, 317, 331, 337, 347,
349, 353, 359, 367, 373, 379, 383, 389, 397, 401, 409, 419, 421, 431, 433, 439,
443, 449, 457, 461, 463, 467, 479, 487, 491, 499, 503, 509, 521, 523, 541, 547,
557, 563, 569, 571, 577, 587, 593, 599, 601, 607, 613, 617, 619, 631, 641, 643,
647, 653, 659, 661, 673, 677, 683, 691, 701, 709, 719, 727, 733, 739, 743, 751,
757, 761, 769, 773, 787, 797, 809, 811, 821, 823, 827, 829, 839, 853, 857, 859,
863, 877, 881, 883, 887, 907, 911, 919, 929, 937, 941, 947, 953, 967, 971, 977,
983, 991, 997]
if num in lowPrimes:
return True
The numbers in the lowPrimes list
are primes. (Duh.) We can immediately return True if
num is in the lowPrimes
for prime in lowPrimes:
if (num % prime == 0):
return False
Line 49 loops through each of the prime numbers in the lowPrimes list. The integer in num
is modded with the % mod operator by each prime
number on line 50, and if this evaluates to 0 then
we know that prime divides num
and so num is not prime. In that case, line 51
returns False.
Checking if num is divisible by
all the primes under 1000 won’t tell us if the number is prime, but it might
tell us if the number is composite. About 30% of the random numbers that generateLargePrime() creates that are composite will be
detected as composite by dividing by the low prime numbers. Dividing by the low
prime numbers is much faster than executing the full Rabin-Miller algorithm on
the number, so this shortcut can make our program execute much more quickly.
return rabinMiller(num)
Those are all the quick tests to determine if a number is
prime or not. But if num does not match any of those
tests, then it is passed to the rabinMiller()
function to check if it is prime or not. The return value of rabinMiller() will be returned by isPrime().
The comment on line 53 means call the rabinMiller()
function to
determine if the number is prime. Please do not call Dr. Rabin or Dr. Miller
personally to ask them if your number is prime.
def generateLargePrime(keysize=1024):
while True:
num = random.randrange(2**(keysize-1), 2**(keysize))
if isPrime(num):
return num
The generateLargePrime() function
will return an integer that is prime. It does this by coming up with a large
random number, storing it in num, and then passing num to isPrime(). The isPrime() function will then test to see if num is composite and then pass the num
to rabinMiller() for a more thorough (and
computationally expensive) primality test.
If the number num is prime, then
line 62 returns num. Otherwise the infinite loop
goes back to line 60 to try a new random number. This loop keeps trying over
and over again until it finds a number that the isPrime()
function says is prime.
Prime numbers have fascinating properties in mathematics. As
you will learn in the next chapter, they are also the backbone of ciphers used
in actual professional encryption software. The definition of a prime number is
simple enough: a number that only has one and itself as factors. But
determining which numbers are prime and which are composite (that is, not
prime) takes some clever code.
Modding a number with all the numbers from two up to the
square root of the number is how our isPrime()
function determines if that number is prime or not. A prime number will never
have a remainder of 0 when it is modded by any number (besides its factors, 1
and itself.) But this can take a while for the computer to calculate when
testing large numbers for primality.
The sieve of Erastothenes can be used to quickly tell if a
range of numbers is prime or not, but this requires a lot of memory.
The RSA cipher makes use of extremely large prime numbers
that are hundreds of digits long. The Sieve of Erastothenes and the basic isPrime() function we have in primeSieve.py
aren’t sophisticated enough to handle numbers this large.
The Rabin-Miller algorithm uses some mathematics which has
simple code (but the mathematical reasoning behind it is too complex for this
book), but it allows us to determine if a number that is hundreds of digits
long is prime.
In the next chapter, we will use the prime number code we
developed for the rabinMiller.py module in our RSA
Cipher program. At last, we will have a cipher easier to use than the one-time
pad cipher but that cannot be hacked by the simple hacker techniques in this求一张魔兽地图(Divide and conquer VEP 100w的下载地址_百度知道求一张魔兽地图(Divide and conquer VEP 100w)_百度知道Ad: Programming books by Al Sweigart
Topics Covered In This Chapter:
?Dictionaries
?The split() Method
?The None Value
?&Divide by Zero& Errors
?The float(), int(),
and str() Functions and Python 2 Division
?The append() List Method
?Default Arguments
?Calculating Percentage
The gaffer says something longer and more
complicated. After a while, Waterhouse (now wearing his cryptoanalyst hat,
searching for meaning midst apparent randomness, his neural circuits exploiting
the redundancies in the signal) realizes that the man is speaking heavily
accented English.
“Cryptonomicon” by Neal Stephenson
A message encrypted with the transposition cipher can have
thousands of possible keys. Your computer can still easily brute-force this
many keys, but you would then have to look through thousands of decryptions to
find the one correct plaintext. This is a big problem for the brute-force
method of cracking the transposition cipher.
When the computer decrypts a message with the wrong key, the
resulting plaintext is garbage text. We need to program the computer to be able
to recognize if the plaintext is garbage text or English text. That way, if the
computer decrypts with the wrong key, it knows to go on and try the next
possible key. And when the computer tries a key that decrypts to English text,
it can stop and bring that key to the attention of the cryptanalyst. Now the
cryptanalyst won’t have to look through thousands of incorrect decryptions.
It can’t. At least, not in the way that human beings like
you or I understand English. Computers don’t really understand math, chess, or
lethal military androids either, any more than a clock understands lunchtime.
Computers just execute instructions one after another. But these instructions
can mimic very complicated behaviors that solve math problems, win at chess, or
hunt down the future leaders of the human resistance.
Ideally, what we need is a Python function (let’s call it isEnglish()) that has a string passed to it and then returns
True if the string is English text and False if it’s random gibberish. Let’s take a look at some
English text and some garbage text and try to see what patterns the two have:
Robots are your friends. Except for RX-686. She
will try to eat you.
e. xrx ne augur iirl6 Rtiyt fhubE6d hrSei
t8..ow eo.telyoosEs
One thing we can notice is that the English text is made up
of words that you could find in a dictionary, but the garbage text is made up
of words that you won’t. Splitting up the string into individual words is easy.
There is already a Python string method named split()
that will do this for us (this method will be explained later). The split() method just sees when each word begins or ends by
looking for the space characters. Once we have the individual words, we can
test to see if each word is a word in the dictionary with code like this:
if word ==
'aardvark' or word == 'abacus' or word == 'abandon' or word == 'abandoned' or
word == 'abbreviate' or word == 'abbreviation' or word == 'abdomen' or …
We can write code like that, but we probably
shouldn’t. The computer won’t mind running through all this code, but you
wouldn’t want to type it all out. Besides, somebody else has already typed out
a text file full of nearly all English words. These text files are called dictionary
files. So we just need to write a function that checks if the
words in the string exist somewhere in that file.
Remember, a dictionary file is a text file that
contains a large list of English words. A dictionary value is a Python
value that has key-value pairs.
Not every word will exist in our “dictionary file”. Maybe
the dictionary file is incomplete and doesn’t have the word, say, “aardvark”.
There are also perfectly good decryptions that might have non-English words in
them, such as “RX-686” in our above English sentence. (Or maybe the plaintext
is in a different language besides English. But we’ll just assume it is in
English for now.)
And garbage text might just happen to have an English word
or two in it by coincidence. For example, it turns out the word “augur” means a
person who tries to predict the future by studying the way birds are flying.
Seriously.
So our function will not be foolproof. But if most of the
words in the string argument are English words, it is a good bet to say that
the string is English text. It is a very low probability that a ciphertext will
decrypt to English if decrypted with the wrong key.
The dictionary text file will have one word per line in
uppercase. It will look like this:
ABANDONING
ABANDONMENT
…and so on. You can download this entire file (which has
over 45,000 words) from /dictionary.txt.
Our isEnglish() function will
have to split up a decrypted string into words, check if each word is in a file
full of thousands of English words, and if a certain amount of the words are English
words, then we will say that the text is in English. And if the text is in
English, then there’s a good bet that we have decrypted the ciphertext with the
correct key.
And that is how the computer can understand if a string is
English or if it is gibberish.
Practice exercises can be found at /hackingpractice12A.
The detectEnglish.py program
that we write in this chapter isn’t a program that runs by itself. Instead, it
will be imported by our encryption programs so that they can call the detectEnglish.isEnglish() function. This is why we don’t
give detectEnglish.py a main()
function. The other functions in the program are all provided for isEnglish() to call.
Open a new file editor window by clicking on File
► New Window. Type in the following code into the file
editor, and then save it as detectEnglish.py. Press
F5 to run the program.
code for detectEnglish.py
10. UPPERLETTERS
= 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
LETTERS_AND_SPACE = UPPERLETTERS + UPPERLETTERS.lower() + ' \t\n'
def loadDictionary():
dictionaryFile = open('dictionary.txt')
englishWords = {}
for word in dictionaryFile.read().split('\n'):
englishWords[word] = None
dictionaryFile.close()
return englishWords
ENGLISH_WORDS = loadDictionary()
def getEnglishCount(message):
message = message.upper()
message = removeNonLetters(message)
possibleWords = message.split()
if possibleWords == []:
return 0.0
matches = 0
for word in possibleWords:
if word in ENGLISH_WORDS:
matches += 1
return float(matches) / len(possibleWords)
def removeNonLetters(message):
lettersOnly = []
for symbol in message:
if symbol in LETTERS_AND_SPACE:
lettersOnly.append(symbol)
return ''.join(lettersOnly)
def isEnglish(message, wordPercentage=20, letterPercentage=85):
wordsMatch = getEnglishCount(message) * 100 &= wordPercentage
numLetters = len(removeNonLetters(message))
messageLettersPercentage = float(numLetters) / len(message) * 100
lettersMatch = messageLettersPercentage &= letterPercentage
return wordsMatch and lettersMatch
These comments at the top of the file give instructions to
programmers on how to use this module. They give the important reminder that if
there is no file named dictionary.txt in the same
directory as detectEnglish.py then this module will
not work. If the user doesn’t have this file, the comments tell them they can
download it from /dictionary.txt.
10. UPPERLETTERS
= 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
LETTERS_AND_SPACE = UPPERLETTERS + UPPERLETTERS.lower() + ' \t\n'
Lines 10 and 11 set up a few variables that are constants,
which is why they have uppercase names. UPPERLETTERS
is a variable containing the 26 uppercase letters, and LETTERS_AND_SPACE
contain these letters (and the lowercase letters returned from UPPERLETTERS.lower()) but also the space character, the
tab character, and the newline character. The tab and newline characters are
represented with escape characters \t and \n.
def loadDictionary():
dictionaryFile = open('dictionary.txt')
The dictionary file sits on the user’s hard drive, but we
need to load the text in this file as a string value so our Python code can use
it. First, we get a file object by calling open()
and passing the string of the filename 'dictionary.txt'.
Before we continue with the loadDictionary() code,
let’s learn about the dictionary data type.
The dictionary data type has values which can
contain multiple other values, just like lists do. In list values, you use an
integer index value to retrieve items in the list, like spam[42].
For each item in the dictionary value, there is a key used to retrieve it.
(Values stored inside lists and dictionaries are also sometimes called items.)
The key can be an integer or a string value, like spam['hello']
or spam[42]. Dictionaries let us organize our
program’s data with even more flexibility than lists.
Instead of typing square brackets like list values,
dictionary values (or simply, dictionaries) use curly braces. Try typing the
following into the interactive shell:
emptyList = []
emptyDictionary = {}
A dictionary’s values are typed out as key-value pairs,
which are separated by colons. Multiple key-value pairs are separated by
commas. To retrieve values from a dictionary, just use square brackets with the
key in between them (just like indexing with lists). Try typing the following
into the interactive shell:
spam = {'key1':'This is a value', 'key2':42}
spam['key1']
'This is a
spam['key2']
It is important to know that, just as with lists, variables
do not store dictionary values themselves, but references to dictionaries. The
example code below has two variables with references to the same dictionary:
spam = {'hello': 42}
eggs = spam
eggs['hello'] = 99
You can add or change values in a dictionary with indexes as
well. Try typing the following into the interactive shell:
spam = {42:'hello'}
print(spam[42])
spam[42] = 'goodbye'
print(spam[42])
And just like lists can contain other lists, dictionaries
can also contain other dictionaries (or lists). Try typing the following into
the interactive shell:
foo = {'fizz': {'name': 'Al', 'age': 144}, 'moo':['a', 'brown', 'cow']}
foo['fizz']
144, 'name': 'Al'}
foo['fizz']['name']
foo['moo']
'brown', 'cow']
foo['moo'][1]
Practice exercises can be found at /hackingpractice12B.
Function with Dictionaries
The len() function can tell you
how many items are in a list or how many characters are in a string, but it can
also tell you how many items are in a dictionary as well. Try typing the
following into the interactive shell:
spam['name'] = 'Al'
spam['pet'] = 'Zophie the cat'
spam['age'] = 89
Operator with Dictionaries
The in operator can also be used
to see if a certain key value exists in a dictionary. It is important to
remember that the in operator checks if a key exists
in the dictionary, not a value. Try typing the following into the interactive
eggs = {'foo': 'milk', 'bar': 'bread'}
'foo' in eggs
'blah blah blah' in eggs
'milk' in eggs
'bar' in eggs
'bread' in eggs
The not in operator works with
dictionary values as well.
for Loops with
Dictionaries
You can also iterate over the keys in a dictionary with for loops, just like you can iterate over the items in a
list. Try typing the following into the interactive shell:
spam = {'name':'Al', 'age':99}
for k in spam:
print(spam[k])
print('==========')
==========
==========
Practice exercises can be found at /hackingpractice12C.
Dictionaries are like lists in many ways, but there are a
few important differences:
1.Dictionary
items are not in any order. There is no “first” or “last” item in a dictionary
like there is in a list.
Dictionaries
do not have concatenation with the + operator. If
you want to add a new item, you can just use indexing with a new key. For
example, foo['a new key'] = 'a string'
only have integer index values that range from 0 to
the length of the list minus one. But dictionaries can have any key. If you
have a dictionary stored in a variable spam, then
you can store a value in spam[3] without needing
values for spam[0], spam[1],
or spam[2] first.
englishWords = {}
In the loadDictionary() function,
we will store all the words in the “dictionary file” (as in, a file that has
all the words in an English dictionary book) in a dictionary value (as in, the
Python data type.) The similar names are unfortunate, but they are two
completely different things.
We could have also used a list to store the string values of
each word from the dictionary file. The reason we use a dictionary is because
the in operator works faster on dictionaries than
lists. Imagine that we had the following list and dictionary values:
listVal = ['spam', 'eggs', 'bacon']
dictionaryVal = {'spam':0, 'eggs':0, 'bacon':0}
Python can evaluate the expression 'bacon'
in dictionaryVal a little bit faster than 'bacon' in
listVal. The reason is technical and you don’t need to know it for the
purposes of this book (but you can read more about it at /listvsdict).
This faster speed doesn’t make that much of a difference for lists and
dictionaries with only a few items in them like in the above example. But our detectEnglish module will have tens of thousands of items,
and the expression word in ENGLISH_WORDS will be
evaluated many times when the isEnglish() function
is called. The speed difference really adds up for the detectEnglish
split() Method
The split() string method returns
a list of several strings. The “split” between each string occurs wherever a
space is. For an example of how the split() string
method works, try typing this into the shell:
'My very energetic mother
just served us Nutella.'.split()
'very', 'energetic', 'mother', 'just', 'served', 'us', 'Nutella.']
The result is a list of eight strings, one string for each
of the words in the original string. The spaces are dropped from the items in
the list (even if there is more than one space). You can pass an optional
argument to the split() method to tell it to split
on a different string other than just a space. Try typing the following into
the interactive shell:
'helloXXXworldXXXhowXXXareXXyou?'.split('XXX')
'world', 'how', 'areXXyou?']
word in dictionaryFile.read().split('\n'):
Line 16 is a for loop that will
set the word variable to each value in the list dictionaryFile.read().split('\n'). Let’s break this
expression down. dictionaryFile is the variable that
stores the file object of the opened file. The dictionaryFile.read()
method call will read the entire file and return it as a very large string
value. On this string, we will call the split()
method and split on newline characters. This split()
call will return a list value made up of each word in the dictionary file
(because the dictionary file has one word per line.)
This is why the expression dictionaryFile.read().split('\n')
will evaluate to a list of string values. Since the dictionary text file has
one word on each line, the strings in the list that split()
returns will each have one word.
None Value
None is a special value that you
can assign to a variable. The None value represents the lack of a
value. None is the only value of the data type NoneType.
(Just like how the Boolean data type has only two values, the NoneType data
type has only one value, None.) It can be very
useful to use the None value when you need a value
that means “does not exist”. The None value is
always written without quotes and with a capital “N” and lowercase “one”.
For example, say you had a variable named quizAnswer which holds the user's answer to some
True-False pop quiz question. You could set quizAnswer
to None if the user skipped the question and did not
answer it. Using None would be better because if you
set it to True or False
before assigning the value of the user's answer, it may look like the user gave
an answer for the question even though they didn't.
Calls to functions that do not return anything (that is,
they exit by reaching the end of the function and not from a return statement) will evaluate to None.
englishWords[word] = None
In our program, we only use a dictionary for the englishWords variable so that the in
operator can find keys in it. We don’t care what is stored for each key, so we
will just use the None value. The for loop that starts on line 16 will iterate over each
word in the dictionary file, and line 17 will use that word as a key in englishWords with None stored
for that key.
dictionaryFile.close()
return englishWords
After the for loop finishes, the englishWords dictionary will have tens of thousands of
keys in it. At this point, we close the file object since we are done reading
from it and then return englishWords.
ENGLISH_WORDS = loadDictionary()
Line 21 calls loadDictionary()
and stores the dictionary value it returns in a variable named ENGLISH_WORDS. We want to call loadDictionary()
before the rest of the code in the detectEnglish
module, but Python has to execute the def statement
for loadDictionary() before we can call the
function. This is why the assignment for ENGLISH_WORDS
comes after the loadDictionary() function’s code.
def getEnglishCount(message):
message = message.upper()
message = removeNonLetters(message)
possibleWords = message.split()
The getEnglishCount() function
will take one string argument and return a float value indicating the amount of
recognized English words in it. The value 0.0 will
mean none of the words in message are English words
and 1.0 will mean all of the words in message are English words, but most likely getEnglishCount() will return something in between 0.0 and 1.0. The isEnglish() function will use this return value as part of
whether it returns True or False.
First we must create a list of individual word strings from
the string in message. Line 25 will convert it to
uppercase letters. Then line 26 will remove the non-letter characters from the
string, such as numbers and punctuation, by calling removeNonLetters().
(We will see how this function works later.) Finally, the split()
method on line 27 will split up the string into individual words that are
stored in a variable named possibleWords.
So if the string 'Hello there. How are
you?' was passed when getEnglishCount() was
called, the value stored in possibleWords after lines
25 to 27 execute would be ['HELLO', 'THERE', 'HOW', 'ARE',
if possibleWords == []:
return 0.0
If the string in message was
something like '12345', all of these non-letter
characters would have been taken out of the string returned from removeNonLetters(). The call to removeNonLetters()
would return the blank string, and when split() is
called on the blank string, it will return an empty list.
Line 29 does a special check for this case, and returns 0.0. This is done to avoid a “divide-by-zero” error (which
is explained later on).
matches = 0
for word in possibleWords:
if word in ENGLISH_WORDS:
matches += 1
The float value that is returned from getEnglishCount()
ranges between 0.0 and 1.0.
To produce this number, we will divide the number of the words in possibleWords that are recognized as English by the total
number of words in possibleWords.
The first part of this is to count the number of recognized
English words in possibleWords, which is done on
lines 32 to 35. The matches variable starts off as 0. The for loop on line 33 will
loop over each of the words in possibleWords, and checks
if the word exists in the ENGLISH_WORDS dictionary.
If it does, the value in matches is incremented on
Once the for loop has completed,
the number of English words is stored in the matches
variable. Note that technically this is only the number of words that are
recognized as English because they existed in our dictionary text file. As far
as the program is concerned, if the word exists in dictionary.txt,
then it is a real English word. And if it doesn’t exist in the dictionary file,
it is not an English word. We are relying on the dictionary file to be accurate
and complete in order for the detectEnglish module
to work correctly.
return float(matches)
/ len(possibleWords)
Returning a float value between 0.0
and 1.0 is a simple matter of dividing the number of
recognized words by the total number of words.
However, whenever we divide numbers using the / operator in Python, we should be careful not to cause a
“divide-by-zero” error. In mathematics, dividing by zero has no meaning. If we
try to get Python to do it, it will result in an error. Try typing the
following into the interactive shell:
(most recent call last):
&&pyshell#0&&, line 1, in &module&
ZeroDivisionError:
int division or modulo by zero
But a divide by zero can’t possibly happen on line 36. The
only way it could is if len(possibleWords) evaluated
to 0. And the only way that would be possible is if possibleWords were the empty list. However, our code on
lines 29 and 30 specifically checks for this case and returns 0.0. So if possibleWords had
been set to the empty list, the program execution would have never gotten past
line 30 and line 36 would not cause a “divide-by-zero” error.
float(), int(), and str() Functions and
Integer Division
float(matches) / len(possibleWords)
The value stored in matches is an
integer. However, we pass this integer to the float()
function which returns a float version of that number. Try typing the following
into the interactive shell:
The int() function returns an
integer version of its argument, and the str()
function returns a string. Try typing the following into the interactive shell:
The float(), int(),
and str() functions are helpful if you need a
value’s equivalent in a different data type. But you might be wondering why we
pass matches to float()
on line 36 in the first place.
The reason is to make our detectEnglish
module work with Python 2. Python 2 will do integer division when both values
in the division operation are integers. This means that the result will be
rounded down. So using Python 2, 22 / 7 will
evaluate to 3. However, if one of the values is a
float, Python 2 will do regular division: 22.0 / 7
will evaluate to 3.143. This is why line
36 calls float(). This is called making the code backwards
compatible with previous versions.
Python 3 always does regular division no matter if the
values are floats or ints.
Practice exercises can be found at /hackingpractice12D.
def removeNonLetters(message):
lettersOnly = []
for symbol in message:
The previously explained getEnglishCount()
function calls the removeNonLetters() function to
return a string that is the passed argument, except with all the numbers and
punctuation characters removed.
The code in removeNonLetters() starts
with a blank list and loops over each character in the message
argument. If the character exists in the LETTERS_AND_SPACE
string, then it is added to the end of the list. If the character is a number
or punctuation mark, then it won’t exist in the LETTERS_AND_SPACE
string and won’t be added to the list.
List Method
if symbol in LETTERS_AND_SPACE:
lettersOnly.append(symbol)
Line 42 checks if symbol (which
is set to a single character on each iteration of line 41’s for loop) exists in the LETTERS_AND_SPACE
string. If it does, then it is added to the end of the lettersOnly
list with the append() list method.
If you want to add a single value to the end of a list, you
could put the value in its own list and then use list concatenation to add it.
Try typing the following into the interactive shell, where the value 42 is added to the end of the list stored in spam:
spam = [2, 3, 5, 7, 9, 11]
spam = spam + [42]
7, 9, 11, 42]
When we add a value to the end of a list, we say we are appending
the value to the list. This is done with lists so frequently in Python that
there is an append() list method which takes a
single argument to append to the end of the list. Try typing the following into
the shell:
eggs.append('hovercraft')
['hovercraft']
eggs.append('eels')
['hovercraft',
eggs.append(42)
['hovercraft',
'eels', 42]
For technical reasons, using the append()
method is faster than putting a value in a list and adding it with the + operator. The append() method
modifies the list in-place to include the new value. You should always prefer
the append() method for adding values to the end of
''.join(lettersOnly)
After line 41’s for loop is done,
only the letter and space characters are in the lettersOnly
list. To make a single string value from this list of strings, we call the join() string method on a blank string. This will join the
strings in lettersOnly together with a blank string
(that is, nothing) between them. This string value is then returned as removeNonLetters()’s return value.
def isEnglish(message, wordPercentage=20, letterPercentage=85):
The isEnglish() function will
accept a string argument and return a Boolean value that indicates whether or
not it is English text. But when you look at line 47, you can see it has three
parameters. The second and third parameters (wordPercentage
and letterPercentage) have equal signs and values
next to them. These are called default arguments. Parameters that have
default arguments are optional. If the function call does not pass an argument
for these parameters, the default argument is used by default.
If isEnglish() is called with
only one argument, the default arguments are used for the wordPercentage
(the integer 20) and letterPercentage
(the integer 85) parameters. Table 12-1 shows
function calls to isEnglish(), and what they are
equivalent to:
Table 12-1. Function calls with and without default
arguments.
isEnglish('Hello')
isEnglish('Hello', 20, 85)
isEnglish('Hello', 50)
isEnglish('Hello', 50, 85)
isEnglish('Hello', 50, 60)
isEnglish('Hello', 50, 60)
isEnglish('Hello',
letterPercentage=60)
isEnglish('Hello', 20, 60)
When isEnglish() is called with
no second and third argument, the function will require that 20% of the words
in message are English words that exist in the
dictionary text file and 85% of the characters in message
are letters. These percentages work for detecting English in most cases. But
sometimes a program calling isEnglish() will want looser
or more restrictive thresholds. If so, a program can just pass arguments for wordPercentage and letterPercentage
instead of using the default arguments.
A percentage is a number between 0 and 100 that shows how
much of something there is proportional to the total number of those things. In
the string value 'Hello cat MOOSE fsdkl ewpin' there
are five “words” but only three of them are English words. To calculate
the percentage of English words, you divide the number of English words by the
total number of words and multiply by 100. The percentage of English
words in 'Hello cat MOOSE fsdkl ewpin' is 3 / 5 * 100, which is 60.
Table 12-2 shows some percentage calculations:
Table 12-2. Some percentage calculations.
The percentage will always be between 0% (meaning none of
the words) and 100% (meaning all of the words). Our isEnglish()
function will consider a string to be English if at least 20% of the words are
English words that exist in the dictionary file and 85% of the characters in
the string are letters (or spaces).
wordsMatch
= getEnglishCount(message) * 100 &= wordPercentage
Line 51 calculates the percentage of recognized English
words in message by passing message
to getEnglishCount(), which does the division for us
and returns a float between 0.0 and 1.0. To get a percentage from this float, we just have to
multiply it by 100. If this number is greater than
or equal to the wordPercentage parameter, then True is stored in wordsMatch.
(Remember, the &= comparison operator evaluates
expressions to a Boolean value.) Otherwise, False is
stored in wordsMatch.
numLetters = len(removeNonLetters(message))
messageLettersPercentage = float(numLetters) / len(message) * 100
lettersMatch = messageLettersPercentage &= letterPercentage
Lines 52 to 54 calculate the percentage of letter characters
in the message string. To determine the percentage
of letter (and space) characters in message, our
code must divide the number of letter characters by the total number of
characters in message. Line 52 calls removeNonLetters(message). This call will return a string
that has the number and punctuation characters removed from the string. Passing
this string to len() will return the number of
letter and space characters that were in message.
This integer is stored in the numLetters variable.
Line 53 determines the percentage of letters getting a float
version of the integer in numLetters and dividing
this by len(message). The return value of len(message) will be the total number of characters in message. (The call to float()
was made so that if the programmer who imports our detectEnglish
module is running Python 2, the division done on line 53 will always be regular
division instead of integer division.)
Line 54 checks if the percentage in messageLettersPercentage
is greater than or equal to the letterPercentage
parameter. This expression evaluates to a Boolean value that is stored in lettersMatch.
return wordsMatch
and lettersMatch
We want isEnglish() to return True only if both the wordsMatch
and lettersMatch variables contain True, so we put them in an expression with the and operator. If both the wordsMatch
and lettersMatch variables are True,
then isEnglish() will declare that the message
argument is English and return True. Otherwise, isEnglish() will return False.
Practice exercises can be found at /hackingpractice12E.
The dictionary data type is useful because like a list it
can contain multiple values. However unlike the list, we can index values in it
with string values instead of only integers. Most of the the things we can do
with lists we can also do with dictionaries, such as pass it to len() or use the in and not in operators on it. In fact, using the in operator on a very large dictionary value executes much
faster than using in on a very large list.
The NoneType data type is also a new data type introduced in
this chapter. It only has one value: None. This
value is very useful for representing a lack of a value.
We can convert values to other data types by using the int(), float(), and str() functions. This chapter brings up “divide-by-zero”
errors, which we need to add code to check for and avoid. The split() string method can convert a single string value
into a list value of many strings. The split()
string method is sort of the reverse of the join()
list method. The append() list method adds a value
to the end of the list.
When we define functions, we can give some of the parameters
“default arguments”. If no argument is passed for these parameters when the
function is called, the default argument value is used instead. This can be a
useful shortcut in our programs.
The transposition cipher is an improvement over the Caesar
cipher because it can have hundreds or thousands of possible keys for messages
instead of just 26 different keys. A computer has no problem decrypting a
message with thousands of different keys, but to hack this cipher, we need to
write code that can determine if a string value is valid English or not.
Since this code will probably be useful in our other hacking
programs, we will put it in its own module so it can be imported by any program
that wants to call its isEnglish() function. All of
the work we’ve done in this chapter is so that any program can do the
following:
import detectEnglish
detectEnglish.isEnglish('Is this sentence English text?')
Now armed with code that can detect English, let’s move on
to the next chapter and hack the transposition cipher!

参考资料

 

随机推荐