Skip to content

liyou969/2420-cheat-sheet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 

Repository files navigation

COMP2420/COMP6420 - Introduction to Data Management, Analysis and Security

Assignment 1 - 2023

Discussion

Use the assignment_1 folder in Piazza discussions. Check to see if your question has already been answered before starting a new topic.

import

import numpy as np

Assignment 1 - 2023

import pandas as pd

from scipy import stats

from sklearn.linear_model import LinearRegression

import matplotlib.pyplot as plt

import seaborn as sns

output

df.to_excel('output.xlsx', index=False, engine='openpyxl')

boxplot

df = pd.read_excel('a.xlsx')

df['a'] = df['Letters'].where(df['Letters']=='a','notA')

grouped = df.groupby('a')['Numbers'].apply(list).to_dict()

fig = plt.figure(figsize=(22,14))

plt.boxplot([grouped['a'], grouped['notA']])

plt.xticks([1,2],["a", "Not A"])

plt.title("Boxplot of a and not_a")

plt.ylabel("Number of day stay at hospital")

plt.xlabel("If they use something")

plt.show()

from scipy.stats import ttest_1samp

t,p = stats.ttest_1samp(SeaIce["Sum_Arctic"], 10)

print("The result of one sample test's t_value is",t)

print("The result of one sample test's p_value is",p)

t1,p1 = stats.ttest_ind(SeaIce["Sum_Antarctic"],SeaIce["Sum_Arctic"])

print("The result of two sample test's is", t1)

print("The result of two sample test's p value is", p1)

#correlation

from scipy.stats import pearsonr

corr, p_value = pearsonr(data[""],data[""])

print("The correlation is ",corr, " units" )

print("The p_value is ",p_value)

print("R square: ", corr * corr)

p_value: This is the p-value associated with that correlation. The p-value gives the probability of observing the current data, or something more extreme, when there's no actual correlation present. Typically:

If the p-value is less than a predetermined significance level (e.g., 0.05), we might reject the null hypothesis of "no correlation between the two variables," believing the relationship between them to be statistically significant. If the p-value is larger, we do not reject the null hypothesis, implying that we don't have sufficient evidence to suggest a statistically significant relationship between the variables.

#T-test ''' If the populations are normally distributed or nearly so, and want to compare the mean of one population with the mean of another population, then a t-test can be used (cf. nonparametric Wilcoxon test).

Null Hypothesis: The means of both populations are equal.

Alternate Hypothesis: The means of both populations are not equal.

A large t-score tells you that the groups are different.

A small t-score tells you that the groups are similar.

'''

T test performs a hypothesis test for the mean between two independent groups of scores eg. claiming the average marks between two similar courses are the same

t, p = stats.ttest_ind(p24_300mg, p24_600mg)

performs a hypothesis test for the mean between two related groups of scores eg. claiming that a particular student's average marks in two different courses are the same

t, p = stats.ttest_rel(p24_300mg, p24_600mg)

stats.ttest_1samp(data[''],76)

p value

The p-value is below 0.05, so reject the null hypothesis: the means of both ... are not equal

p-value is larger than 0.05 so it is out of the rejection region,

thus we can not reject the null hypothesis and decide that the is equivalent between and .

sns

Line Plot: trends and relationships of continuous variables, such as time series data or variables changing with a parameter.

Scatter Plot: relationship between two continuous variables, helping to observe correlations or distributions between variables.

!Bar Plot: plt.bar()适用于比较不同类别或组之间的离散数据。 plt.hist()适用于展示连续变量的分布情况。

Histogram: display the distribution of numerical data, helping to understand the central tendency and dispersion of the data.

Box Plot:display the distribution of numerical data, including median, quartiles, and outliers, allowing the observation of outliers and distribution shapes.

Heatmap: Used to show the relationship between two categorical variables, often using colors to represent the degree of association or frequency.

Violin Plot: Combines the features of a box plot and a kernel density plot, used to display the distribution and density of numerical data.

Categorical Plot: Includes bar plots, count plots, box plots, etc., used to display data distribution and relationships between different categories. ''' #hue for different lines sns.lineplot(x = '', y='', hue = '' , data=q3q4_df) sns.lineplot(x=x, y=y) plt.title('') plt.show()

sns.scatterplot(x=x, y=y)

sns.barplot(x=x, y=y)

sns.histplot(data)

sns.boxplot(data=data)

sns.heatmap(data, cmap='YlGnBu', annot=True, fmt='.2f')

sns.violinplot(x=x, y=y,hue='')

date

#start of week2 start='2023-02-27'

#end of week2 end='2023-03-03'

#create a new column in dataframe to record the date of each row SeaIce["Date"] = pd.to_datetime(SeaIce[["Year","Month","Day"]])

is_date = (SeaIce['Date'] >= start) & (SeaIce["Date"] <= end)

plot more than one graph in one pic

x = SeaIce["Date"] y = SeaIce["Extent(Antarctic)"] z = SeaIce["Extent(Arctic)"]

plt.plot(x,y,label="Antarctic") plt.plot(x,z,label="Arctic")

plt.xlabel("Date") plt.ylabel("Sea Ice extents(10^6 sq km)") plt.title("Daily trend of the Antarctic and Arctic sea ice extents")

plt.legend() plt.show()

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published