In [1]:
options(repr.plot.width=4, repr.plot.height=4)
library(igraph)
library(VGAM)
library(ggplot2)
Attaching package: ‘igraph’

The following objects are masked from ‘package:stats’:

    decompose, spectrum

The following object is masked from ‘package:base’:

    union

Loading required package: stats4
Loading required package: splines
Registered S3 methods overwritten by 'ggplot2':
  method         from 
  [.quosures     rlang
  c.quosures     rlang
  print.quosures rlang
In [10]:
#text <- tolower(paste(readLines('gutenberg/chesterton-brown.txt',
#                                encoding='UTF-8'), collapse = " "))
                                  
text <- tolower(paste(readLines('lutherbibel.txt',
                                encoding='UTF-8'), collapse = " "))

words <- unlist(strsplit(text, "\\s+"))
words <- words[words != '']

word.frequencies <- data.frame(table(words))
word.frequencies <- word.frequencies[order(word.frequencies$Freq, decreasing=T),]
n <- nrow(word.frequencies)
rownames(word.frequencies) <- 1:n
word.frequencies$rank <- 1:n


word.frequencies[1:30,]

plot(Freq~rank, word.frequencies,
     log='xy')
A data.frame: 30 × 3
wordsFreqrank
<fct><int><int>
und 46383 1
der 18736 2
die 16583 3
zu 10083 4
sie 9772 5
den 8753 6
das 8685 7
er 8535 8
dass 8052 9
ich 796710
nicht 756211
in 703212
des 698913
dem 688514
ist 641715
aber 597016
von 595117
da 545518
mit 544519
auf 543920
du 532321
denn 510022
ein 492223
ihr 459924
es 443325
so 441426
herr 432027
an 389028
herrn 369729
wie 336130
In [11]:
tail(word.frequencies, 20)
A data.frame: 20 × 3
wordsFreqrank
<fct><int><int>
23663zweitausend) 123663
23664zweiten: 123664
23665zweitenmal 123665
23666zweiunddreißigste 123666
23667zweiunddreißigsten 123667
23668zweiundfünfzigsten 123668
23669zweiundfünfzigtausend123669
23670zweiundreißig 123670
23671zweiundsiebzigtausend123671
23672zweizüngig 123672
23673zwiebeln 123673
23674zwiefältiger 123674
23675zwinge 123675
23676zwinger 123676
23677zwingst 123677
23678zwölfe 123678
23679zwölfen: 123679
23680zwölfmal 123680
23681zypressen 123681
23682zypressenholz 123682
In [19]:
x <- round(exp(rnorm(10000)))
In [26]:
plot(sort(as.numeric(table(x)), decreasing=T), log='xy')

ggplot(data.frame(x=x), aes(x=x, y=..count..)) + 
    geom_histogram() +
    scale_x_log10() + scale_y_log10()
Warning message:
“Transformation introduced infinite values in continuous x-axis”`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning message:
“Removed 2502 rows containing non-finite values (stat_bin).”Warning message:
“Transformation introduced infinite values in continuous y-axis”Warning message:
“Removed 11 rows containing missing values (geom_bar).”
In [24]:
ggplot(word.frequencies, aes(x=Freq,y=..count..)) +
    geom_histogram(fill='white', col='black') +
    scale_x_log10() +
    scale_y_log10()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning message:
“Transformation introduced infinite values in continuous y-axis”Warning message:
“Removed 2 rows containing missing values (geom_bar).”
In [25]:
d <- read.csv('staedte.csv')

plot(sort(d$Bevölkerung, decreasing=T), log='xy')


ggplot(d, aes(x=Bevölkerung, y=..count..)) +
    geom_histogram(fill='white', col='black') +
    scale_x_log10() +
    scale_y_log10()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning message:
“Transformation introduced infinite values in continuous y-axis”Warning message:
“Removed 3 rows containing missing values (geom_bar).”
In [27]:
(fit <- fit_power_law(d$Bevölkerung))
$continuous
FALSE
$alpha
2.3087305834986
$xmin
21104
$logLik
-7134.4410121721
$KS.stat
0.0231888102724385
$KS.p
0.891057605619371
In [34]:
alpha <- 2
xmin <- 10

x <- xmin*(runif(100000)^(1/(1-alpha)))
In [35]:
ggplot(data.frame(x=x), aes(x=x, y=..count..))+
geom_histogram() +
scale_x_log10() + scale_y_log10()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning message:
“Transformation introduced infinite values in continuous y-axis”Warning message:
“Removed 3 rows containing missing values (geom_bar).”
In [36]:
plot(sort(x, decreasing=T), log='xy')
In [40]:
plot(zeta(2, shift=1:100))
In [41]:
simulatePL <- function(alpha, xmin) {
    g <- function(x) {
        (x^{1-alpha} - (x+1)^(1-alpha))/xmin^(1-alpha)
    }
    f <- function(x) {
        (x^-alpha)/zeta(alpha, shift=xmin)
    }
    C <- f(xmin)/g(xmin)
    weiter <- TRUE
    while (weiter) {
        x <- floor(xmin*runif(1)^(1/(1-alpha)))
        if (runif(1) < f(x)/(C*g(x))) {
            weiter <- FALSE
        }
    }
    return(x)
}
In [42]:
x <- replicate(10000, simulatePL(2, 1))
In [44]:
ggplot(data.frame(x=x), aes(x=x)) + geom_histogram() +
scale_x_log10() + scale_y_log10()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning message:
“Transformation introduced infinite values in continuous y-axis”Warning message:
“Removed 6 rows containing missing values (geom_bar).”
In [45]:
fit_power_law(x)
$continuous
FALSE
$alpha
2.01084226712964
$xmin
1
$logLik
-16185.948699514
$KS.stat
0.00743602627438023
$KS.p
0.63794481606571
In [ ]: