Data analysis and data mining

A. Azzalini - B. Scarpa

DATA SETS USED IN THE EXAMPLES

Simulated data
Some of the data used were obtained by means of simulation of pseudo-random numbers, as follows. All files are in text format.
- `Yesterday's data and tomorrow's data.'. A table with 30 rows (other than those with variable names) and 3 columns, contains variables x, y.yesterday, y.tomorrow, with self-explanatory names. These data are used in Chapter 3 and 4 and in section 4.8.
- Data for three classes, of sizes 120, 80, and 100, used in chapter 5.The data table contains 300 rows (other than those with variable names)and 3 columns, for two explanatory variables, z₁ and z₂, and one class indicator. Some of the numerical examples in chapter 5 refer to data in thefirst two classes.
- C1 e C2 Two data collections used in section 6.1, each with two variables, with 250 and 100 points.
Car data

The car data (text format, 23719 byte) were obtained by simple manipulation of original data which referred to the characteristics of 203 automobile models imported into the USA in 1985. The original data are available at:
ftp://ftp.ics.uci.edu/pub/machine-learning-databases/autos

and their manipulation on our part simply consisted of converting one unit of measurement to another, and eliminating some variables. The new variables are described in the file: auto.names
Brazilian bank data

The data for the brazilian bank (text format/csv, 50019 byte) were obtained by simple manipulation of original data referring to a customer satisfaction survey by a Brazilian bank. For 500 subjects, randomly selected from the bank's customers, some information from marketing research was obtained.

The variables are described in the file: brazil.names.
Data for telephone company customers

The data for telephone customers (zip format, 4304617 byte) was obtained by simple manipulation of original data referring to the characteristics of 30,619 customers of a European telephone company with post-pay contracts. Data are organized in two zip files already randomly divided in training and test sets.
To be part of the set, the customers had to be active in the 10 consecutive months to which the data refer which are conventionally indicated by numbers from 1 to 10 ($nn=01,\dots,10$).

The original data were processed simply by eliminating some of the original variables. For the customers, the available variables are described in the file telekom.names
Insurance data
The data on insurance customers (zip format, 106163 byte) was obtained by simple manipulation of original data on the characteristics of a sample of 5,000 customers of a European insurance company.
To be part of the set, the customers had to take out one policy in at least one of the company's lines of business.
Processing of the original data consisted simply of eliminating some of the original variables. For these customers, the available variables are in the file claims.names
.
Choice of fruit juice data

The data on fruit juice purchases (text format, 91580 byte) were presented in Chapter 11 of Foster, Stine and Waterman "Business Analysis Using Regression" (published by Springer-Verlag, 1998), and are available through the distribution system for statistical information StatLib, by following the path Main >> Data Archive >> Datasets Archive >> business.

The data refer to 1070 fruit juice purchases of two different brands (MM and CH) in certain US supermarkets, supplied with some contributory variables. The data used in Chapter 5 were slightly processed, in the sense that some characteristics of little importance were excluded.
The variable used are described in the file juice.names.

Variable loyaltyMM is constructed starting from the value 0.5 and updating with every purchase by the same customer, with a value which increases by 20\% of the current difference between the current value and 1, if the customer chose MM, and falls by 20\% of the difference between the current value and 0 if the customer chose CH. The corresponding variable loyaltyCH is given by $1-loyaltyMM.

There are five stores in question, numbered from 0 to 4.
Customer satisfaction data

The data on customer satisfaction (zip format, 128703 byte) were obtained by simple manipulation of original data on responses to a questionnaire from a survey of 4,000 customers of a European IT company producing and selling software and offering consulting services. The survey was carried out by an independent marketing research company specializing in such surveys.
Processing of original data consisted simply of eliminating some of the original variables.
The variable used are described in the file customer.names.
Web usage data
The data on web usage (zip format, 684157 byte) refer to hits made by 26,157 anonymous visitors to the website of a consultant company. Data were collected from web log files, collecting all relevant information about hits on every page of the website.
A user session describes the sequence of web pages viewed consecutively by a visitor, without leaving the website or the connection. We call these sequences of pages `visits'. The website to which the data refer does not have a `cookie system' or other way of identifying the same visitor in different sessions, so that we consider each session in the analysis as a new visitor and we call the same event `session' or `visit' indifferently.
All pages visited in a year are included in the dataset. Sessions are labeled with a identification number, and no personal information is available. The website has 215 pages and the total number of page views (hits) on the entire site was 47,387. Some of the pages have similar contents and were aggregated in 8 categories (home, contacts, communications, events, company, white papers, business units, consulting). The day and time of all visits to every single page are also recorded in the dataset. For each single event (visit to a page) the available variables are described in the file webdata.names.

Back to the base-page of the book